cs.CL [Back]

[1] Collaborative and Proactive Management of Task-Oriented Conversations

Arezoo Saedi,Afsaneh Fatemi,Mohammad Ali Nematbakhsh,Sophie Rosset,Anne Vilnat

Main category: cs.CL

TL;DR: 本文提出了一种基于信息状态方法的任务导向对话系统模型，利用大语言模型的上下文学习能力，通过引入构造性中间信息进行目标感知规划，提升了任务完成率和用户满意度。

Details

Motivation: 现有的任务导向对话系统往往忽视了有效的目标感知规划，而这种规划对于任务完成至关重要。因此，本文旨在通过改进规划机制来提高对话系统的性能。 Method: 采用信息状态方法构建对话管理系统，定义预设槽位和文本部分的信息组件以建模用户偏好，并识别关键情境下的信息组件，形成有限的信息状态。基于这些状态设计对话动作及转换策略，并利用大语言模型的上下文学习实现模型。 Result: 在MultiWOZ数据集上进行评估，结果显示该模型在单领域对话中达到了最高的通知率和成功率，相较于先前的方法有所提升。 Conclusion: 所提出的模型通过有效整合中间信息和目标感知规划，显著提高了任务导向对话系统的性能，特别是在处理用户偏好和数据库查询方面表现出色。 Abstract: Task oriented dialogue systems (TOD) complete particular tasks based on user preferences across natural language interactions. Considering the impressive performance of large language models (LLMs) in natural language processing (NLP) tasks, most of the latest TODs are centered on LLMs. While proactive planning is crucial for task completion, many existing TODs overlook effective goal-aware planning. This paper creates a model for managing task-oriented conversations, conceptualized centered on the information state approach to dialogue management. The created model incorporated constructive intermediate information in planning. Initially, predefined slots and text part informational components are created to model user preferences. Investigating intermediate information, critical circumstances are identified. Informational components corresponding to these circumstances are created. Possible configurations for these informational components lead to limited information states. Then, dialogue moves, which indicate movement between these information states and the procedures that must be performed in the movements, are created. Eventually, the update strategy is constructed. The created model is implemented leveraging in-context learning of LLMs. In this model, database queries are created centered on indicated predefined slots and the order of retrieved entities is indicated centered on text part. This mechanism enables passing the whole corresponding entities to the preferences in the order of congruency. Evaluations exploiting the complete test conversations of MultiWOZ, with no more than a domain in a conversation, illustrate maximal inform and success, and improvement compared with previous methods.

[2] Trainable Reference-Based Evaluation Metric for Identifying Quality of English-Gujarati Machine Translation System

Nisheeth Joshi,Pragya Katyayan,Palak Arora

Main category: cs.CL

TL;DR: 本文提出了一种基于监督学习的、用于古吉拉特语的机器翻译评估指标，通过25个特征训练了两个不同隐藏层数的模型，并在1000个翻译输出上验证其与人类评分的相关性优于现有指标。

Details

Motivation: 由于适用于英语和其他欧洲语言的机器翻译评估方法在印度语言上表现不佳，因此需要为古吉拉特语开发更有效的评估指标。 Method: 提出一种基于监督学习的参考式MT评估指标，使用25个特征训练两个神经网络模型，分别包含6层和10层隐藏层，均训练500个epoch。 Result: 在包含1000个样本的数据集上测试表明，所提出的指标与人类评分的相关性优于其他现有指标。 Conclusion: 该研究成功开发出适用于古吉拉特语的MT评估指标，显著提升了与人类判断的相关性，为印度语言的MT评估提供了有效解决方案。 Abstract: Machine Translation (MT) Evaluation is an integral part of the MT development life cycle. Without analyzing the outputs of MT engines, it is impossible to evaluate the performance of an MT system. Through experiments, it has been identified that what works for English and other European languages does not work well with Indian languages. Thus, In this paper, we have introduced a reference-based MT evaluation metric for Gujarati which is based on supervised learning. We have trained two versions of the metric which uses 25 features for training. Among the two models, one model is trained using 6 hidden layers with 500 epochs while the other model is trained using 10 hidden layers with 500 epochs. To test the performance of the metric, we collected 1000 MT outputs of seven MT systems. These MT engine outputs were compared with 1 human reference translation. While comparing the developed metrics with other available metrics, it was found that the metrics produced better human correlations.

[3] Hallucination is Inevitable for LLMs with the Open World Assumption

Bowen Xu

Main category: cs.CL

TL;DR: 本文重新定义大语言模型中的“幻觉”为泛化问题的表现，并在封闭世界和开放世界假设下分析其不可避免性，提出应将其视为需与人类智能兼容的结构性特征而非单纯缺陷。

Details

Motivation: 考虑到实现人工通用智能（AGI）的条件，现有将幻觉视为可消除缺陷或理论必然性的观点均不完整，需要更深入的理解。 Method: 通过引入封闭世界与开放世界的假设，构建幻觉的分类体系，区分可纠正与不可避免的幻觉类型。 Result: 发现在开放世界假设下，幻觉是不可避免的，且部分类型无法完全消除；而在封闭世界假设下，幻觉可能被缓解。 Conclusion: 幻觉不仅是工程缺陷，更是模型泛化的结构性特征，未来系统设计应容忍并管理幻觉，使其与人类智能协同工作。 Abstract: Large Language Models (LLMs) exhibit impressive linguistic competence but also produce inaccurate or fabricated outputs, often called ``hallucinations''. Engineering approaches usually regard hallucination as a defect to be minimized, while formal analyses have argued for its theoretical inevitability. Yet both perspectives remain incomplete when considering the conditions required for artificial general intelligence (AGI). This paper reframes ``hallucination'' as a manifestation of the generalization problem. Under the Closed World assumption, where training and test distributions are consistent, hallucinations may be mitigated. Under the Open World assumption, however, where the environment is unbounded, hallucinations become inevitable. This paper further develops a classification of hallucination, distinguishing cases that may be corrected from those that appear unavoidable under open-world conditions. On this basis, it suggests that ``hallucination'' should be approached not merely as an engineering defect but as a structural feature to be tolerated and made compatible with human intelligence.

[4] Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models

Durgesh Nandini,Rebekka Koch,Mirco Schoenfeld

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLM）在从经济学领域的自然语言文本中提取主谓宾三元组的有效性，特别应用于区域贸易协定文本的信息抽取，并比较了零样本、单样本和少样本提示技术的性能。

Details

Motivation: 旨在利用大语言模型自动构建经济领域的结构化知识图谱，提升从复杂法律贸易文本中提取关键信息的效率与准确性。 Method: 采用Llama 3.1模型，结合正负示例，通过零样本、单样本和少样本提示方法对区域贸易协定文本进行三元组抽取，并使用定量与定性指标评估性能。 Result: 模型在不同提示设置下能够有效提取贸易相关三元组，少样本提示结合正负例表现最佳，但仍面临实体歧义和长文本理解等挑战。 Conclusion: 大语言模型在经济文本知识提取中具有巨大潜力，合理设计提示策略可显著提升性能，为构建经济领域知识图谱提供了可行路径。 Abstract: This study investigates the effectiveness of Large Language Models (LLMs) for the extraction of structured knowledge in the form of Subject-Predicate-Object triples. We apply the setup for the domain of Economics application. The findings can be applied to a wide range of scenarios, including the creation of economic trade knowledge graphs from natural language legal trade agreement texts. As a use case, we apply the model to regional trade agreement texts to extract trade-related information triples. In particular, we explore the zero-shot, one-shot and few-shot prompting techniques, incorporating positive and negative examples, and evaluate their performance based on quantitative and qualitative metrics. Specifically, we used Llama 3.1 model to process the unstructured regional trade agreement texts and extract triples. We discuss key insights, challenges, and potential future directions, emphasizing the significance of language models in economic applications.

[5] CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

Jie Zhu,Yuanchen Zhou,Shuo Jiang,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong

Main category: cs.CL

TL;DR: 提出CARE框架，通过强化原始训练集的逻辑推理能力来提升情感支持对话的质量，无需依赖大规模合成数据。

Details

Motivation: 现有研究多关注数据增强和合成语料库构建，但忽视了有效情感支持背后的深层认知推理过程。 Method: 利用原始ESC训练集引导模型生成逻辑连贯且具支持性的回应，并结合强化学习进一步优化推理过程。 Result: 实验结果表明，CARE显著提升了回应的逻辑性和支持性质量。 Conclusion: CARE推动了共情性强、认知稳健且类人的情感支持系统的发展。 Abstract: Emotional Support Conversation (ESC) plays a vital role in alleviating psychological stress and providing emotional value through dialogue. While recent studies have largely focused on data augmentation and synthetic corpus construction, they often overlook the deeper cognitive reasoning processes that underpin effective emotional support. To address this gap, we propose \textbf{CARE}, a novel framework that strengthens reasoning in ESC without relying on large-scale synthetic data. CARE leverages the original ESC training set to guide models in generating logically coherent and supportive responses, thereby explicitly enhancing cognitive reasoning. Building on this foundation, we further employ reinforcement learning to refine and reinforce the reasoning process. Experimental results demonstrate that CARE significantly improves both the logical soundness and supportive quality of responses, advancing the development of empathetic, cognitively robust, and human-like emotional support systems.

[6] MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

Mingjin Li,Yu Liu,Huayi Liu,Xiang Ye,Chao Jiang,Hongguang Zhang

Main category: cs.CL

TL;DR: MADS是一个通过多智能体自对弈生成多轮说服性对话的可扩展框架，利用用户、对话和优化三个智能体模拟并优化对话效果，显著提升了小规模大模型在真实营销场景中的转化率。

Details

Motivation: 解决行业中缺乏用户数据、冷启动评估困难和提示效率低等挑战，低成本生成无需人工标注的训练数据。 Method: 采用三个协同智能体（用户智能体、对话智能体和优化智能体）进行多轮对话模拟，并结合态度链（CoA）建模和专用大模型评估来说服效果。 Result: 在真实营销场景中，MADS将小规模大模型的有机流量转化率从1.83%提升至2.24%，增幅达22.4%。 Conclusion: MADS能有效生成高质量说服性对话数据，显著提升模型说服能力，具备明确的商业应用价值。 Abstract: We propose MADS (Multi-Agent Dialogue Simulation), a scalable framework for generating persuasive multi-turn dialogues via agent self-play. MADS employs three coordinated agents: User Agents simulating diverse persona-driven behaviors, a Dialog Agent executing task-oriented persuasion strategies and an Optimization Agent evaluating and refining dialogue outcomes. We further validate its effectiveness through users' Chain-of-Attitude (CoA) modeling and dedicated LLMs' persuasion assessment. This approach enables low-cost generation of training data without human annotation, addressing key industry challenges such as lack of user data, cold-start evaluation difficulties, and prompt inefficiency. Applied to a real-world marketing scenario, MADS significantly improved the persuasion capacity of small LLMs, increasing the organic traffic conversion rate by 22.4\% (from 1.83\% to 2.24\%) , demonstrating clear business value.

[7] Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Reza Shirkavand,Xiaokai Wei,Chen Wang,Zheng Hui,Heng Huang,Michelle Gong

Main category: cs.CL

TL;DR: 本文提出了IDIOMoE模型，通过将物品交互历史视为语言空间中的原生方言，统一了协同过滤与大语言模型的优势，实现了推荐性能与文本理解的兼顾。

Details

Motivation: 现代推荐系统需要同时具备协同过滤的高效准确和大语言模型的语义表达能力，以满足用户对自然语言查询和可解释性的更高要求。 Method: 提出Item-ID + Oral-language Mixture-of-Experts Language Model（IDIOMoE），在预训练大语言模型中为文本和物品交互历史分别设置独立的专家网络，并通过token类型门控机制避免模态间的干扰。 Result: IDIOMoE在多个公开和私有数据集上均表现出色，推荐性能强，同时保持了原始大语言模型的文本理解能力。 Conclusion: 将物品交互作为语言方言融入大语言模型是可行且有效的，IDIOMoE为融合协同信号与语义推理提供了新范式。 Abstract: While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.

[8] Improving Metacognition and Uncertainty Communication in Language Models

Mark Steyvers,Catarina Belem,Padhraic Smyth

Main category: cs.CL

TL;DR: 通过监督微调可以提升大语言模型在不同任务和领域中表达不确定性的能力，但不同元认知技能之间缺乏自然迁移，需要多任务训练来共同提升。

Details

Motivation: 大语言模型在决策场景中广泛应用，但由于其无法准确表达低置信度，可能导致用户依赖错误输出。现有模型的显式置信度表达通常校准不佳，难以区分正确与错误答案。因此，研究如何改进模型对不确定性的表达至关重要。 Method: 对两种类型的大语言模型进行监督微调，数据集涵盖常识、数学和开放性 trivia 问题，并评估两种元认知任务：单题置信度估计和成对置信度比较。同时测试模型在未见领域（如医学和法律推理）上的泛化能力。 Result: 微调显著提升了模型的置信度校准性和判别力（即正确答案对应更高置信度），且在跨领域表现良好，而准确性保持不变。但单任务训练的效果不具备任务间迁移性：单题校准训练不能提升成对比较表现，反之亦然。多任务微调则带来更广泛的改进，在域外评估中表现出更低的校准误差和更强的判别能力。 Conclusion: 大语言模型的不确定性表达能力是可训练且可泛化的，但不同的元认知技能不会自然相互促进，必须通过多任务训练协同发展。 Abstract: Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. While prior work shows that LLMs maintain internal uncertainty signals, their explicit verbalized confidence is typically miscalibrated and poorly discriminates between correct and incorrect answers. Across two types of LLMs, we investigate whether supervised finetuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We finetune the LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to have correct. We assess generalization to unseen domains, including medical and legal reasoning. Results show that finetuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains, while leaving accuracy unchanged. However, improvements are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. In contrast, multitask finetuning on both forms of metacognition yields broader gains, producing lower calibration error and stronger discrimination in out-of-domain evaluations. These results show that while uncertainty communication in LLMs is trainable and generalizable, different metacognitive skills do not naturally reinforce one another and must be developed together through multitask training.

[9] Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models

Si-Ioi Ng,Pranav S. Ambadi,Kimberly D. Mueller,Julie Liss,Visar Berisha

Main category: cs.CL

TL;DR: 提出一种基于BERT的流水线方法，用于自动提取和排序Cookie Theft图片描述中的内容信息单元（CIU），有效表征视觉叙述路径，以评估认知障碍。

Details

Motivation: 现有自动化评估认知语言障碍的方法常忽略说话者描述图片时的视觉叙述路径，且当前基于手动标注或词典映射的分析方法费时费力。 Method: 采用经过二元交叉熵和成对排序损失微调的BERT模型，构建自动化CIU提取与排序的流水线，并通过5折交叉验证评估性能。 Result: 在CIU检测上达到93%中位精度和96%中位召回率，序列错误率为24%；所提取特征与真实标签有强皮尔逊相关性，在外部验证中优于词典基线方法，且在ANCOVA分析中表现接近人工标注特征。 Conclusion: 该流水线能有效刻画视觉叙述路径，可用于认知障碍的自动化评估，且模型与实现已开源。 Abstract: Current methods for automated assessment of cognitive-linguistic impairment via picture description often neglect the visual narrative path - the sequence and locations of elements a speaker described in the picture. Analyses of spatio-semantic features capture this path using content information units (CIUs), but manual tagging or dictionary-based mapping is labor-intensive. This study proposes a BERT-based pipeline, fine tuned with binary cross-entropy and pairwise ranking loss, for automated CIU extraction and ordering from the Cookie Theft picture description. Evaluated by 5-fold cross-validation, it achieves 93% median precision, 96% median recall in CIU detection, and 24% sequence error rates. The proposed method extracts features that exhibit strong Pearson correlations with ground truth, surpassing the dictionary-based baseline in external validation. These features also perform comparably to those derived from manual annotations in evaluating group differences via ANCOVA. The pipeline is shown to effectively characterize visual narrative paths for cognitive impairment assessment, with the implementation and models open-sourced to public.

[10] Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models

Qingshu Xu,Hong Jiao,Tianyi Zhou,Ming Li,Nan Zhang,Sydney Peters,Yanbin Fu

Main category: cs.CL

TL;DR: 本研究比较了三种自动化方法在将测试题目与内容标准（四个领域和十九个技能标签）对齐中的表现，发现基于BERT的模型（如DeBERTa-v3-base和RoBERTa-large）效果最佳，优于传统机器学习模型和集成方法。

Details

Motivation: 确保测试题目与内容标准的准确对齐对于大规模评估中的分数解释至关重要，传统人工对齐成本高且耗时，因此需要高效、准确的自动化方法。 Method: 采用了三种方法：1）基于嵌入特征训练经典机器学习模型，并考察降维的影响；2）微调八种BERT及其变体模型进行领域和技能对齐；3）探索基于多数投票和堆叠的集成学习方法。 Result: DeBERTa-v3-base在领域对齐上达到0.950的加权F1分数，RoBERTa-large在技能对齐上达到0.869的F1分数，均为最佳表现；集成模型未超过最优语言模型；降维提升了线性分类器性能，但仍不及语言模型。 Conclusion: 基于Transformer的预训练语言模型在自动化题目-标准对齐任务中表现最优，是实现高效、精准对齐的可行方案，未来可进一步探索模型解释性和跨学科适用性。 Abstract: Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments. This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels. First, we extracted embeddings and trained multiple classical supervised machine learning models, and further investigated the impact of dimensionality reduction on model performance. Second, we fine-tuned eight BERT model and its variants for both domain and skill alignment. Third, we explored ensemble learning with majority voting and stacking with multiple meta-models. The DeBERTa-v3-base achieved the highest weighted-average F1 score of 0.950 for domain alignment while the RoBERTa-large yielded the highest F1 score of 0.869 for skill alignment. Ensemble models did not surpass the best-performing language models. Dimension reduction enhanced linear classifiers based on embeddings but did not perform better than language models. This study demonstrated different methods in automated item alignment to content standards.}

[11] Submodular Context Partitioning and Compression for In-Context Learning-short paper

Shaoyi Zheng,Canyu Zhang,Tianyi Zhou,Shengjie Wang

Main category: cs.CL

TL;DR: 提出了一种名为Sub-CP的块感知上下文选择框架，利用子模目标控制块间多样性，提升大语言模型中上下文学习的性能。

Details

Motivation: 解决现有高效上下文学习方法因分块策略导致的信息冗余或表示不足问题，以克服Transformer输入复杂度高的限制。 Method: 提出Sub-CP框架，采用子模函数在分块上下文中灵活控制多样性与连贯性之间的平衡，支持预计算和细粒度语义结构调控。 Result: 在多个数据集和任务上验证了Sub-CP的有效性，显示其在不同规模模型下均能持续提升性能。 Conclusion: Sub-CP通过块感知的多样化选择机制，显著提升了上下文学习的效果，具有良好的通用性和可扩展性。 Abstract: In-context learning (ICL) enables efficient few-shot learning in large language models (LLMs) without training, but suffers from the quadratic input complexity of transformers, limiting the maximum number of exemplars. While various efficient ICL approaches partition the context into blocks to process (e.g., ensembling, compression, cross-attention), they often ignore the information redundancy or under-representation caused by different partition strategies, leading to suboptimal performance. To tackle this problem, we propose Sub-CP, a block-aware context selection framework that leverages submodular objectives to control block diversity. Sub-CP supports a flexible spectrum of selection strategies, allowing each block to range from globally diverse to locally coherent. This allows fine-grained control over semantic structure while enabling precomputation. Extensive experiments across diverse tasks on multiple datasets show that Sub-CP consistently improves performance across model scales.

[12] Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery

Bowen Wei

Main category: cs.CL

TL;DR: 提出一种结合轻量级容错词法检索、基于嵌入的向量相似性和受限大语言模型重排序的混合语义搜索系统，以解决Head Start项目中因专业术语和词汇搜索局限性导致的任务查找困难。

Details

Motivation: 新员工或轮岗员工在GoEngage平台上难以快速准确地找到所需任务模块，主要由于领域特定术语、系统专有命名以及传统词汇搜索对拼写错误和词语顺序敏感的问题。 Method: 采用混合语义搜索方法，融合 typo-tolerant 词法检索、向量相似性匹配和受限LLM重排序，并利用现有任务库与知识库基础设施，通过智能缓存、短列表生成和优雅降级机制提升效率与鲁棒性。 Result: 设计了一个包含离线评估（Hit@K, Precision@K, Recall@K, MRR）和在线测量（查询成功率、零结果率、停留时间代理指标）的完整评估框架，并提出了分阶段实施策略与资源需求规划。 Conclusion: 该方法在保持低误报率、可演化性和经济高效性的前提下，显著提升了任务查找的准确性与用户体验，适用于术语复杂且人员流动频繁的教育管理系统。 Abstract: Head Start programs utilizing GoEngage face significant challenges when new or rotating staff attempt to locate appropriate Tasks (modules) on the platform homepage. These difficulties arise from domain-specific jargon (e.g., IFPA, DRDP), system-specific nomenclature (e.g., Application Pool), and the inherent limitations of lexical search in handling typos and varied word ordering. We propose a pragmatic hybrid semantic search system that synergistically combines lightweight typo-tolerant lexical retrieval, embedding-based vector similarity, and constrained large language model (LLM) re-ranking. Our approach leverages the organization's existing Task Repository and Knowledge Base infrastructure while ensuring trustworthiness through low false-positive rates, evolvability to accommodate terminological changes, and economic efficiency via intelligent caching, shortlist generation, and graceful degradation mechanisms. We provide a comprehensive framework detailing required resources, a phased implementation strategy with concrete milestones, an offline evaluation protocol utilizing curated test cases (Hit@K, Precision@K, Recall@K, MRR), and an online measurement methodology incorporating query success metrics, zero-result rates, and dwell-time proxies.

[13] Training Large Language Models To Reason In Parallel With Global Forking Tokens

Sheng Jia,Xiao Wang,Shiva Prasad Kasiviswanathan

Main category: cs.CL

TL;DR: 本文提出了一种基于集合的监督微调方法（SSFT），通过引入全局损失函数和二分匹配机制，在保持推理多样性的同时提升大模型在复杂问题上的准确性。

Details

Motivation: 现有的并行推理方法在鼓励多样性时往往牺牲准确性，尤其是在复杂问题上难以触发多样且正确的推理路径。因此，需要一种能同时保证多样性和准确性的新方法。 Method: 将并行推理视为一组下一个token预测问题，提出Set Supervised Fine-Tuning（SSFT），在监督微调中引入基于集合的全局损失函数，并使用自监督的二分匹配来对齐全局分叉token与独特的推理轨迹。 Result: 实验表明，SSFT在多个推理基准上均优于传统的SFT方法，在Pass@1和Cons@k指标下表现更优，且能有效保留多种独特的推理模式并产生全局分叉token。 Conclusion: SSFT通过结构化的损失设计，有效解决了推理过程中多样性与准确性的权衡问题，提升了大规模语言模型在复杂任务上的推理性能。 Abstract: Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.

[14] Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

Y. Du,G. Wu,G. Tang,W. Wang,Q. Fan

Main category: cs.CL

TL;DR: 本文研究了在不同规模的语言模型中，合成数据比例对模型性能、校准性和输出特性的影响，发现最多20%的合成数据可保持性能稳定，超过30%后性能迅速下降，且大模型对合成数据更具鲁棒性。

Details

Motivation: 尽管合成数据在现代NLP训练中广泛应用，但其比例对不同规模模型行为的影响尚缺乏系统性理解。 Method: 使用Pythia模型系列（410M-12B参数）在五个不同任务上进行控制实验，评估1到3轮训练中0-50%合成数据比例下的模型表现。 Result: 模型在合成数据占比≤20%时性能稳定，>30%时性能加速下降；大模型（6.9B-12B）比小模型更鲁棒；校准性退化先于准确率下降；推理任务比检索任务退化更快。当前最佳实践（如STaR、Self-Instruct）使用的高外部数据比例处于安全范围内。 Conclusion: 提供了基于模型规模和任务需求的合成数据使用建议，为实际应用中的合成数据预算提供了指导。 Abstract: Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.

[15] Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Vanya Bannihatti Kumar,Divyanshu Goyal,Akhil Eppa,Neel Bhandari

Main category: cs.CL

TL;DR: 提出一种基于好奇心驱动的LLM-as-a-judge方法，用于个性化评估创造性写作，在TTCW基准上优于传统监督微调方法。

Details

Motivation: 现有大模型在主观创造性评估任务上表现不佳，且不同评价者间存在分歧，需要更个性化的评估方法。 Method: 采用好奇心驱动的LLM-as-a-judge框架，使模型学习个体的创造性判断，基于TTCW基准进行训练与评估。 Result: 在多种模型规模下，该方法在皮尔逊相关系数、Cohen's kappa和F1等指标上均优于基线SFT方法，尤其适用于评价者意见不一致的主观评估场景。 Conclusion: 所提方法能有效捕捉个体化创造性判断，提升LLM在主观创造性评估中的表现。 Abstract: Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

[16] Linguistic Characteristics of AI-Generated Text: A Survey

Luka Terčon,Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 本文综述了当前关于AI生成文本语言特征的研究，系统地归纳了现有成果，并指出研究多集中于英语和GPT模型，缺乏跨语言和跨模型的比较，同时提示敏感性问题亟待深入探讨。

Details

Motivation: 随着大语言模型在各领域的广泛应用，迫切需要系统梳理AI生成文本的语言特征，以理解其对语言学及相关领域的影响。 Method: 通过多维度分类现有研究，包括语言描述层级、涉及模型、文本体裁、语言种类和提示方式，并以此框架总结研究发现与趋势。 Result: 发现AI生成文本更倾向于正式、非个人化风格，名词、限定词和介词更多，形容词和副词较少；词汇多样性低、重复性高；研究主要集中于英语和GPT系列模型，提示敏感性常被忽视。 Conclusion: 需加强跨语言、跨模型的比较研究，并在未来工作中系统控制提示变化，以深化对AI生成文本语言特征的理解。 Abstract: Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.

[17] Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Maojia Song,Renhang Liu,Xinyu Wang,Yong Jiang,Pengjun Xie,Fei Huang,Soujanya Poria,Jingren Zhou

Main category: cs.CL

TL;DR: 提出WebDetective基准和评估框架，解决现有RAG和网络代理在多跳搜索任务中推理路径泄露和评估不细粒度的问题，揭示模型在知识利用和适当拒绝方面的系统性缺陷，并通过EvidenceLoop工作流实现改进。

Details

Motivation: 现有RAG系统和网络代理在多跳深度搜索任务中的评估存在推理路径泄露和单一通过率评估的局限，无法真实反映模型自主发现推理链的能力，需更严谨的基准和细粒度评估方法。 Method: 构建无提示多跳问题基准WebDetective，结合可控的Wikipedia沙箱环境，设计分离搜索充分性、知识利用和拒绝行为的评估框架，并测试25种前沿模型；提出EvidenceLoop代理工作流以针对性提升表现。 Result: 发现当前模型普遍存在知识利用不足和缺乏适当拒绝的问题，尽管证据充足仍难以有效使用，且在证据不足时几乎不拒绝；EvidenceLoop在搜索与综合能力上均有所提升。 Conclusion: 当前系统擅长执行给定推理路径但难以自主发现路径，WebDetective及其诊断框架可有效识别弱点并指导架构改进，推动真正自主推理系统的发展。 Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

[18] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Gregory Hok Tjoan Go,Khang Ly,Anders Søgaard,Amin Tabatabaei,Maarten de Rijke,Xinyi Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为LiRA的多智能体协作框架，用于自动生成科学文献综述，能够在写作质量、引用准确性和与人工撰写文本的相似性方面优于现有方法。

Details

Motivation: 科学出版物快速增长使得文献综述难以保持全面和更新，现有自动化方法多集中于检索与筛选，而撰写阶段尤其在可读性和事实准确性方面仍缺乏探索。 Method: 设计了一个模拟人类文献综述过程的多智能体系统LiRA，包含负责内容提纲、段落撰写、编辑和审阅的专用智能体，通过协作生成连贯且全面的综述文章。 Result: 在SciReviewGen和ScienceDirect数据集上，LiRA在写作质量和引用准确性方面优于AutoSurvey和MASS-Survey等基线方法，同时保持与人工撰写高度相似，并展现出对不同评审模型的良好鲁棒性。 Conclusion: 研究表明，无需领域特定调优的基于智能体的LLM工作流有望提升自动化学术写作的可靠性与实用性。 Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.

[19] NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

Hamed Jelodar,Mohammad Meymani,Parisa Hamedi,Tochukwu Emmanuel Nwankwo,Samita Bai,Roozbeh Razavi-Far,Ali A. Ghorbani

Main category: cs.CL

TL;DR: 本文提出了NLD-LLM框架，用于评估语言模型生成源代码描述的性能，强调提示工程对小模型性能的显著提升。

Details

Motivation: 为了系统评估语言模型在自然语言描述任务中生成准确且简洁代码描述的能力。 Method: 构建包含多种Transformer模型的评估框架，采用标准化提示设计和迭代优化策略，并使用语义与结构指标进行分析。 Result: 实验表明，良好的提示工程能显著提升模型表现，使小型模型也能与大型模型竞争。 Conclusion: 提示工程在自然语言描述任务中至关重要，合理的提示设计可缩小不同规模模型间的性能差距。 Abstract: Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.

[20] To model human linguistic prediction, make LLMs less superhuman

Byung-Doh Oh,Tal Linzen

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）在预测下一个词方面优于人类，但在模拟人类阅读行为方面表现下降，因其具备“超人”级别的长时和短时记忆能力；本文主张构建具有类人记忆特性的模型，并提出改进方向与所需的人类实验数据。

Details

Motivation: 尽管LLMs在语言预测任务上表现出色，但其预测人类阅读行为的能力却在下降，这引发了对它们作为人类认知模型有效性的质疑。研究动机在于理解为何更强大的语言模型反而更难模拟人类语言处理，并探讨如何使其更具认知现实性。 Method: 分析当前LLMs与人类在语言预测中的差异，识别导致‘超人’表现的两个关键因素：更强的长期记忆（对事实和训练样本的记忆）和短期记忆（对上下文的记忆），并提出构建具有类人记忆限制的模型的可能路径。 Result: 发现LLMs的‘超人’预测能力源于其远超人类的记忆容量，这导致其低估了人类阅读时的实际处理难度；现有LLMs因此不适合作为精确的人类语言认知模型。 Conclusion: 为了使LLMs更好地模拟人类语言理解，需要引入人类记忆的局限性；同时，当前缺乏足够的人类行为数据来评估此类模型的进步，需设计新的心理语言学实验来填补这一空白。 Abstract: When people listen to or read a sentence, they actively make predictions about upcoming words: words that are less predictable are generally read more slowly than predictable ones. The success of large language models (LLMs), which, like humans, make predictions about upcoming words, has motivated exploring the use of these models as cognitive models of human linguistic prediction. Surprisingly, in the last few years, as language models have become better at predicting the next word, their ability to predict human reading behavior has declined. This is because LLMs are able to predict upcoming words much better than people can, leading them to predict lower processing difficulty in reading than observed in human experiments; in other words, mainstream LLMs are 'superhuman' as models of language comprehension. In this position paper, we argue that LLMs' superhumanness is primarily driven by two factors: compared to humans, LLMs have much stronger long-term memory for facts and training examples, and they have much better short-term memory for previous words in the text. We advocate for creating models that have human-like long-term and short-term memory, and outline some possible directions for achieving this goal. Finally, we argue that currently available human data is insufficient to measure progress towards this goal, and outline human experiments that can address this gap.

[21] Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

Xin Wang,Anshu Raj,Matthew Luebbe,Haiming Wen,Shuozhi Xu,Kun Lu

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的多阶段信息提取管道，用于从实验性材料文献中提取涵盖成分、加工、微观结构和性能的47个特征，显著提高了提取精度和可靠性。

Details

Motivation: 现有信息提取方法通常局限于少量特征，且未涵盖成分-加工-微观结构-性能之间的综合关系，导致难以构建全面的材料数据库。 Method: 开发了一个多阶段信息提取流程，结合迭代提取与来源追踪，利用大语言模型从非结构化文献中提取47个关键材料特征，并在特征级和元组级进行评估。 Result: 在特征级和元组级的F1得分均达到约0.96；相比单次提取，微观结构类别的F1得分提升显著（特征级+10.0%，元组级+13.7%），漏检材料数从49降至13（漏检率从12.4%降至3.3%），实现零误报。 Conclusion: 该方法实现了高精度、低遗漏、可扩展的文献挖掘，生成的数据集适用于机器学习和材料信息学，模块化设计可推广至多种材料体系。 Abstract: Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.

[22] SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation

Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa

Main category: cs.CL

TL;DR: 本文提出了SynCED-EnDe，一个用于机器翻译中关键错误检测的新数据集，相较于WMT21具有更大规模、更均衡的标签、更广泛的领域覆盖和更新的时间内容。

Details

Motivation: 现有WMT21英文-德文关键错误检测数据集在规模、标签平衡性、领域覆盖和时效性方面存在局限，难以支持可靠的错误检测研究。 Method: 构建了一个包含1,000个人工标注和8,000个银标标注的英德句对数据集，涵盖多样化的2024-2025年来源，并引入显式的错误子类、结构化触发标志及细粒度辅助判断（如明显性、严重性等）。 Result: 实验表明，基于XLM-R等编码器在SynCED-EnDe上相比WMT21有显著性能提升，归因于标签平衡和精细标注；数据集已公开并附带文档与基线脚本。 Conclusion: SynCED-EnDe有望成为推动机器翻译在信息检索、对话系统及可穿戴AI设备等新兴场景中安全部署的社区资源。 Abstract: Critical Error Detection (CED) in machine translation aims to determine whether a translation is safe to use or contains unacceptable deviations in meaning. While the WMT21 English-German CED dataset provided the first benchmark, it is limited in scale, label balance, domain coverage, and temporal freshness. We present SynCED-EnDe, a new resource consisting of 1,000 gold-labeled and 8,000 silver-labeled sentence pairs, balanced 50/50 between error and non-error cases. SynCED-EnDe draws from diverse 2024-2025 sources (StackExchange, GOV.UK) and introduces explicit error subclasses, structured trigger flags, and fine-grained auxiliary judgments (obviousness, severity, localization complexity, contextual dependency, adequacy deviation). These enrichments enable systematic analyses of error risk and intricacy beyond binary detection. The dataset is permanently hosted on GitHub and Hugging Face, accompanied by documentation, annotation guidelines, and baseline scripts. Benchmark experiments with XLM-R and related encoders show substantial performance gains over WMT21 due to balanced labels and refined annotations. We envision SynCED-EnDe as a community resource to advance safe deployment of MT in information retrieval and conversational assistants, particularly in emerging contexts such as wearable AI devices.

[23] Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs

Qi Li,Runpeng Yu,Haiquan Lu,Xinchao Wang

Main category: cs.CL

TL;DR: 本文提出了一种针对离散扩散大语言模型（dLLMs）的新型模型归属方法，通过构建有向解码图（DDM）和高斯轨迹归属（GTA）来有效识别模型来源。

Details

Motivation: 现有的模型归属方法在处理dLLMs时效果不佳，主要因为其双向解码机制导致置信度冗余，难以捕捉解码过程中的结构信息。需要一种能适应不同模型、检查点和备份的通用归属方法。 Method: 提出有向解码图（DDM）以提取解码步骤间的结构关系，并采用高斯轨迹归属（GTA），通过在每个解码位置拟合单元高斯分布，利用轨迹的对数似然作为归属评分。 Result: 实验表明，所提方法在多种设置下均显著优于基于置信度的方法，能够更准确地区分不同模型及同一模型的不同检查点。 Conclusion: DDM与GTA结合能有效挖掘dLLMs解码过程中的结构性特征，为模型归属提供了高效且鲁棒的新框架。 Abstract: Discrete Diffusion Large Language Models (dLLMs) have recently emerged as a competitive paradigm for non-autoregressive language modeling. Their distinctive decoding mechanism enables faster inference speed and strong performance in code generation and mathematical tasks. In this work, we show that the decoding mechanism of dLLMs not only enhances model utility but also can be used as a powerful tool for model attribution. A key challenge in this problem lies in the diversity of attribution scenarios, including distinguishing between different models as well as between different checkpoints or backups of the same model. To ensure broad applicability, we identify two fundamental problems: what information to extract from the decoding trajectory, and how to utilize it effectively. We first observe that relying directly on per-step model confidence yields poor performance. This is mainly due to the bidirectional decoding nature of dLLMs: each newly decoded token influences the confidence of other decoded tokens, making model confidence highly redundant and washing out structural signal regarding decoding order or dependencies. To overcome this, we propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors. Furthermore, to make full use of the extracted structural information during attribution, we propose Gaussian-Trajectory Attribution (GTA), where we fit a cell-wise Gaussian distribution at each decoding position for each target model, and define the likelihood of a trajectory as the attribution score: if a trajectory exhibits higher log-likelihood under the distribution of a specific model, it is more likely to have been generated by that model. Extensive experiments under different settings validate the utility of our methods.

[24] Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Donghang Wu,Haoyang Zhang,Chen Chen,Tianyu Zhang,Fei Tian,Xuerui Yang,Gang Yu,Hexin Liu,Nana Hou,Yuchen Hu,Eng Siong Chng

Main category: cs.CL

TL;DR: 本文提出了一种名为“时间顺序思考”（Chronological Thinking）的机制，旨在提升全双工对话系统中语音对话语言模型的响应质量，该机制在倾听用户语音的同时进行增量式推理，无需额外延迟，且严格因果。

Details

Motivation: 现有全双工对话系统在倾听阶段让模型持续预测静音标记，处于空闲状态，不符合人类在对话中持续轻量思考的行为。因此需要一种更贴近人类交流方式的实时思考机制。 Method: 提出Chronological Thinking机制，采用严格因果的增量推理方式，在接收流式音频输入时持续更新内部假设，且将推理过程分摊在倾听窗口内，用户停止说话后立即生成回应，不引入额外延迟。 Result: 实验表明，该方法在客观指标和人工评估中均显著提升了响应质量，能有效应对对话动态变化，并在全双工交互指标上表现优异。 Conclusion: Chronological Thinking为全双工语音对话系统提供了一种高效、符合人类交流习惯的实时思考范式，显著提升了对话响应质量与交互自然性。 Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

[25] Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA

Prudence Djagba,Abdelkader Y. Saley

Main category: cs.CL

TL;DR: 该研究探讨了领域适应的大语言模型（FinMA）在金融自然语言处理中的优缺点，发现其在情感分析和分类任务中表现良好，但在数值推理、实体识别和摘要生成方面存在挑战。

Details

Motivation: 金融应用对准确性、可靠性和领域适应性要求极高，因此需要深入评估领域适应的LLM在专业金融任务中的实际表现。 Method: 基于PIXIU框架构建FinMA模型，使用金融指令调优（FIT）数据集进行指令微调，并在FLARE基准下进行评估。 Result: FinMA在情感分析和分类任务中表现优异，但在数值推理、命名实体识别和文本摘要方面表现不佳。 Conclusion: 金融领域大语言模型需针对性设计和评估，尤其应加强复杂推理与关键信息提取能力，以更好支持金融决策。 Abstract: This research explores the strengths and weaknesses of domain-adapted Large Language Models (LLMs) in the context of financial natural language processing (NLP). The analysis centers on FinMA, a model created within the PIXIU framework, which is evaluated for its performance in specialized financial tasks. Recognizing the critical demands of accuracy, reliability, and domain adaptation in financial applications, this study examines FinMA's model architecture, its instruction tuning process utilizing the Financial Instruction Tuning (FIT) dataset, and its evaluation under the FLARE benchmark. Findings indicate that FinMA performs well in sentiment analysis and classification, but faces notable challenges in tasks involving numerical reasoning, entity recognition, and summarization. This work aims to advance the understanding of how financial LLMs can be effectively designed and evaluated to assist in finance-related decision-making processes.

[26] A Single Character can Make or Break Your LLM Evals

Jingtong Su,Jianyu Zhang,Karen Ullrich,Léon Bottou,Mark Ibrahim

Main category: cs.CL

TL;DR: 本文研究了在大语言模型（LLM）评估中，上下文示例分隔符的选择对模型性能的显著影响，发现不同分隔符可导致性能波动高达±23%，并提出通过提示明确指定分隔符来提升鲁棒性。

Details

Motivation: 探究在LLM评估中常被忽视的分隔符格式选择问题，揭示其对模型输出质量的潜在重大影响。 Method: 在多个主流模型家族（Llama、Qwen、Gemma）上测试不同分隔符（如逗号、换行、分号、井号等）对MMLU等任务性能的影响，并通过注意力头分析分隔符如何引导模型关注关键输入token。 Result: 发现分隔符选择显著影响模型性能，性能差异可达±23%；特定分隔符可改变模型排名；该现象跨任务和模型普遍存在且不随模型规模增加而改善；良好的分隔符能有效引导注意力。 Conclusion: LLM对分隔符选择极为敏感，提示中显式说明所用分隔符可增强鲁棒性，建议在实际应用中谨慎选择并标准化分隔符使用。 Abstract: Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23\%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

[27] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu,Shu Yang,Michiel A. Bakker,Alex Pentland,Jiaxin Pei

Main category: cs.CL

TL;DR: 本文提出了DeliberationBank，一个大规模基于人类标注的辩论摘要评估数据集，并训练了DeliberationJudge模型以更准确地评估审议摘要的质量，揭示了现有LLM在代表性方面的持续缺陷。

Details

Motivation: 现有的大语言模型（LLM）在总结大规模公共审议内容时可能忽视少数观点并受输入顺序影响，存在公平性问题；同时，当前依赖LLM作为评判者的评估方法与人类判断一致性差，亟需更可靠的大规模评估手段。 Method: 构建包含3000名参与者意见数据和4500名参与者人工评分的DeliberationBank数据集，涵盖十个议题及四个评估维度（代表性、信息量、中立性、政策认可度），并在此基础上微调DeBERTa模型，训练出DeliberationJudge用于自动评估摘要质量。 Result: DeliberationJudge相比多种LLM评判者更高效且与人类判断高度一致；通过对18种LLM的评估发现，它们普遍存在对少数观点的表征不足问题。 Conclusion: 该研究提供了一种可扩展且可靠的审议摘要评估框架，有助于提升AI系统在政策制定中的代表性和公平性。 Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

[28] A novel hallucination classification framework

Maksym Zavhorodnii,Dmytro Dehtiarov,Anna Konovalenko

Main category: cs.CL

TL;DR: 提出一种基于提示工程和无监督学习的幻觉检测方法，通过向量空间中聚类分析实现对大语言模型输出的幻觉识别。

Details

Motivation: 大语言模型在推理过程中容易产生幻觉（即虚假信息），影响其可靠性，因此需要自动检测并区分这些错误输出。 Method: 构建幻觉类型学，通过提示工程复现多种幻觉；使用嵌入模型将数据集映射到向量空间，并在降维后采用无监督学习分析幻觉与真实回答的分布差异。 Result: 定量分析显示，幻觉的严重程度与其在向量空间中偏离正确输出簇的中心距离呈正相关，即使简单分类器也能有效区分幻觉与真实响应。 Conclusion: 该方法为单一LLM提供了一种轻量且有效的幻觉检测框架，有助于提升模型输出的可信度。 Abstract: This work introduces a novel methodology for the automatic detection of hallucinations generated during large language model (LLM) inference. The proposed approach is based on a systematic taxonomy and controlled reproduction of diverse hallucination types through prompt engineering. A dedicated hallucination dataset is subsequently mapped into a vector space using an embedding model and analyzed with unsupervised learning techniques in a reduced-dimensional representation of hallucinations with veridical responses. Quantitative evaluation of inter-centroid distances reveals a consistent correlation between the severity of informational distortion in hallucinations and their spatial divergence from the cluster of correct outputs. These findings provide theoretical and empirical evidence that even simple classification algorithms can reliably distinguish hallucinations from accurate responses within a single LLM, thereby offering a lightweight yet effective framework for improving model reliability.

[29] Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Chenghao Yang,Lin Gui,Chenxiao Yang,Victor Veitch,Lizhu Zhang,Zhuokai Zhao

Main category: cs.CL

TL;DR: 提出了一种名为Exploratory Annealed Decoding (EAD)的探索策略，通过在生成过程中从高到低退火采样温度，提升大语言模型在强化学习中的推理能力和训练稳定性。

Details

Motivation: 标准的固定温度采样难以在保持样本质量和促进探索之间取得平衡，需要一种更有效的探索策略来提升强化学习中奖励可验证场景下的语言模型推理能力。 Method: 提出EAD方法，采用退火机制，在生成序列初期使用高温鼓励探索，后期降低温度以利用当前策略并保持样本质量，特别关注早期token对语义方向的影响。 Result: EAD在多种RLVR算法和不同规模模型上均优于固定温度采样，显著提高样本效率，同时保持训练稳定性和生成质量。 Conclusion: 将探索策略与序列生成的自然动态相结合（如早期探索、后期利用）是一种提升大语言模型推理性能的有效且稳健的方法。 Abstract: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.

[30] Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Tarek Naous,Anagha Savit,Carlos Rafael Catalan,Geyang Guo,Jaehyeok Lee,Kyungdon Lee,Lheane Marie Dizon,Mengyu Ye,Neel Kothari,Sahajpreet Singh,Sarah Masud,Tanish Patwa,Trung Thanh Tran,Zohaib Khan,Alan Ritter,JinYeong Bak,Keisuke Sakaguchi,Tanmoy Chakraborty,Yuki Arase,Wei Xu

Main category: cs.CL

TL;DR: 本文提出了Camellia，一个用于衡量九种亚洲语言中以实体为中心的文化偏见的基准，发现多语言大模型在亚洲语言中的文化适应能力较差，且存在与特定情感相关的文化偏见，并在实体提取中表现出跨文化的性能差距。

Details

Motivation: 由于缺乏多语言基准，尚不清楚大模型在非西方语言中是否存在文化偏见，尤其是对亚洲语言中的文化公平性问题缺乏系统评估。 Method: 构建了包含19,530个人工标注实体和2,173个来自社交媒体的掩码上下文的Camellia基准，覆盖六种亚洲文化、九种亚洲语言；在文化语境适应、情感关联和实体抽取问答等任务上评估四个主流多语言大模型家族。 Result: 发现大模型在所有亚洲语言中的文化适应能力均较弱，不同模型因训练数据来源差异表现不同；各模型家族表现出不同的文化情感关联偏见；在实体提取任务中存在明显的跨文化性能差距。 Conclusion: 当前多语言大模型在处理亚洲语言时普遍存在文化偏见和理解不足的问题，亟需更多文化相关数据和针对性建模以提升文化公平性。 Abstract: As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.

[31] RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Yining She,Daniel W. Peterson,Marianne Menglin Liu,Vikas Upadhyay,Mohammad Hossein Chaghazardi,Eunsuk Kang,Dan Roth

Main category: cs.CL

TL;DR: 研究了基于大语言模型（LLM）的防护机制在检索增强生成（RAG）场景下的上下文鲁棒性，发现插入无害文档会改变防护判断，表明当前防护机制存在上下文鲁棒性缺陷。

Details

Motivation: 随着大语言模型的广泛应用，确保其安全性至关重要。现有的基于LLM的外部防护模型可能因数据分布变化而失效，尤其是在RAG等引入额外上下文信息的场景中，亟需评估其鲁棒性。 Method: 以检索增强生成（RAG）为案例，系统评估了3个Llama Guard和2个GPT-oss模型在输入和输出防护中的表现，分析了检索文档、用户查询和生成响应各部分对判断结果的影响，并测试了两种缓解方法。 Result: 实验表明，在防护模型的上下文中插入良性文档会导致约11%的输入防护和8%的输出防护判断被改变，且现有缓解方法效果有限，揭示了当前防护机制在上下文组合下的脆弱性。 Conclusion: 当前基于LLM的防护机制在面对上下文扰动时不够鲁棒，未来需要设计更具鲁棒性的训练和评估方案，以应对检索和查询组合带来的挑战。 Abstract: With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.

[32] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu,Xianda Du,Qingchen Hu,Jiahao Liang,Jingwei Ni,Dan Qiang,Kaiyu Huang,Grant McKenzie,Renee Sieber,Fengran Mo

Main category: cs.CL

TL;DR: 本文提出了WeatherArchive-Bench，首个用于评估基于历史天气档案的检索增强生成（RAG）系统的基准，包含检索与评估两项任务，揭示了现有密集检索器和大语言模型在处理历史术语及社会脆弱性与韧性概念上的局限性。

Details

Motivation: 历史天气档案包含丰富的社会应对极端天气事件的定性信息，但其规模庞大、数字化质量差且语言古老，难以转化为结构化知识，亟需有效方法支持气候研究。 Method: 构建了WeatherArchive-Bench基准，包括WeatherArchive-Retrieval（从百万级新闻片段中检索相关段落）和WeatherArchive-Assessment（利用大语言模型分类社会脆弱性与韧性指标）两项任务，并对多种检索器和大语言模型进行了广泛实验。 Result: 实验表明，密集检索器在处理历史术语时表现不佳，而大语言模型常误解脆弱性和韧性概念，暴露出现有RAG系统在理解复杂社会指标方面的关键缺陷。 Conclusion: 研究揭示了当前技术在处理历史气候档案中的挑战，为构建更鲁棒的面向气候研究的RAG系统提供了重要见解，所构建的数据集和评估框架已公开。 Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

[33] Residualized Similarity for Faithfully Explainable Authorship Verification

Peter Zeng,Pegah Alipoormolabashi,Jihu Mun,Gourab Dey,Nikita Soni,Niranjan Balasubramanian,Owen Rambow,H. Schwartz

Main category: cs.CL

TL;DR: 提出了一种名为Residualized Similarity (RS)的新方法，结合可解释特征与神经网络，在保持可解释性的同时提升作者验证系统的性能。

Details

Motivation: 现有的神经方法虽然准确率高，但缺乏可解释性，而实际应用中的决策需要可追溯到原文本的可解释特征。 Method: 利用神经网络预测可解释系统相似性度量的残差（即误差），从而在保留可解释性的基础上提升性能。 Result: 在四个数据集上的实验表明，该方法能达到最先进的作者验证模型的性能，同时提供忠实且可解释的预测结果。 Conclusion: Residualized Similarity 方法成功地在不牺牲可解释性的前提下，提升了基于可解释特征的作者验证系统的性能，适用于需负责任使用的场景。 Abstract: Responsible use of Authorship Verification (AV) systems not only requires high accuracy but also interpretable solutions. More importantly, for systems to be used to make decisions with real-world consequences requires the model's prediction to be explainable using interpretable features that can be traced to the original texts. Neural methods achieve high accuracies, but their representations lack direct interpretability. Furthermore, LLM predictions cannot be explained faithfully -- if there is an explanation given for a prediction, it doesn't represent the reasoning process behind the model's prediction. In this paper, we introduce Residualized Similarity (RS), a novel method that supplements systems using interpretable features with a neural network to improve their performance while maintaining interpretability. Authorship verification is fundamentally a similarity task, where the goal is to measure how alike two documents are. The key idea is to use the neural network to predict a similarity residual, i.e. the error in the similarity predicted by the interpretable system. Our evaluation across four datasets shows that not only can we match the performance of state-of-the-art authorship verification models, but we can show how and to what degree the final prediction is faithful and interpretable.

[34] The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

Alexander M. Fichtl,Jeremias Bohn,Josefin Kelber,Edoardo Mosca,Georg Groh

Main category: cs.CL

TL;DR: 本文综述了克服Transformer注意力机制二次复杂度瓶颈的近期研究进展，评估了多种替代架构的潜力。

Details

Motivation: Transformer的注意力机制具有固有的二次复杂度，随着上下文长度增加成为显著瓶颈，亟需更高效的序列建模方法。 Method: 调研并对比了亚二次注意力变体、循环神经网络、状态空间模型和混合架构等近期方法，在计算与内存复杂度、基准结果和根本局限性方面进行批判性分析。 Result: 系统评估了各类方法在效率和性能上的权衡，揭示了纯注意力Transformer可能面临的挑战与未来发展方向。 Conclusion: 尽管Transformer仍占主导地位，但新兴架构在长上下文场景下展现出挑战其统治地位的潜力。 Abstract: Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.

[35] Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

Yufeng Du,Minyang Tian,Srikanth Ronanki,Subendhu Rongali,Sravan Bodapati,Aram Galstyan,Azton Wells,Roy Schwartz,Eliu A Huerta,Hao Peng

Main category: cs.CL

TL;DR: 即使大语言模型能够完美检索相关信息，输入长度的增加仍会导致性能显著下降，表明输入长度本身会影响模型表现，提出通过提示模型复述检索到的证据来缓解该问题。

Details

Motivation: 研究大语言模型在长上下文任务中性能未随上下文长度扩展而提升的原因，挑战‘只要检索准确模型性能就不会下降’的普遍假设。 Method: 在数学、问答和编程任务上对5个开源和闭源大模型进行系统实验，控制信息检索质量，测试不同输入长度对性能的影响，包括使用空白符、掩码无关token以及将相关证据置于问题前等设置。 Result: 发现即使检索完美且无干扰，模型性能仍随输入增长而大幅下降（13.9%–85%）；即使强制只关注相关token或把证据紧邻问题放置，性能仍下降；提出通过让模型先复述证据再解题的方法可提升GPT-4o性能最多4%。 Conclusion: 输入长度本身会损害大语言模型性能，独立于检索质量和干扰因素；需重新思考长上下文建模中的核心挑战，并采用如证据复述等简单策略进行缓解。 Abstract: Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.

[36] Cross-Lingual Mental Health Ontologies for Indian Languages: Bridging Patient Expression and Clinical Understanding through Explainable AI and Human-in-the-Loop Validation

Ananth Kandala,Ratna Kandala,Akshata Kishore Moharir,Niva Manchanda,Sunaina Singh

Main category: cs.CL

TL;DR: 提出一种基于图的跨语言患者压力表达框架（CL-PDE），用于构建涵盖印度多种语言和文化背景的心理健康本体，以弥补现有临床NLP系统在文化多样性和语言包容性方面的不足。

Details

Motivation: 现有心理健康资源多基于英语或西方文化，难以准确表达印度多语言背景下患者的痛苦感受，导致临床NLP在本地化应用中存在表征缺失问题。 Method: 采用图结构方法构建跨语言心理压力表达网络（CL-PDE），对不同印度语言中的文化嵌入式痛苦表达进行建模，并与临床术语对齐，实现跨语言语义关联。 Result: 该框架能够有效捕捉并连接多种印度语言中文化相关的心理压力表达，提升NLP系统在多元语言环境下的文化适应性和表达覆盖能力。 Conclusion: CL-PDE为多语言、多文化背景下的心理健康交流提供了更具包容性和以患者为中心的AI支持路径，有助于缩小全球心理健康技术中的文化与语言鸿沟。 Abstract: Mental health communication in India is linguistically fragmented, culturally diverse, and often underrepresented in clinical NLP. Current health ontologies and mental health resources are dominated by diagnostic frameworks centered on English or Western culture, leaving a gap in representing patient distress expressions in Indian languages. We propose cross-linguistic graphs of patient stress expressions (CL-PDE), a framework for building cross-lingual mental health ontologies through graph-based methods that capture culturally embedded expressions of distress, align them across languages, and link them with clinical terminology. Our approach addresses critical gaps in healthcare communication by grounding AI systems in culturally valid representations, allowing more inclusive and patient-centric NLP tools for mental health care in multilingual contexts.

[37] Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care

Junyi Fan,Li Sun,Negin Ashrafi,Kamiar Alaei,Maryam Pishgar

Main category: cs.CL

TL;DR: 该研究利用直接偏好优化（DPO）方法微调Mistral-7B语言模型，基于MIMIC-III数据库中的8,838条心力衰竭护理记录和21,210组专家验证的偏好数据，显著提升了ICU护理文档质量。

Details

Motivation: 重症监护病房的护理文档常存在术语不一致、格式非标准化等问题，尤其在心力衰竭护理中影响临床信息传递，亟需一种能兼顾隐私保护与文档质量提升的方法。 Method: 采用直接偏好优化（DPO）对Mistral-7B模型进行微调，使用来自MIMIC-III数据库的心力衰竭护理记录及由专家验证的GPT输出、模型生成和原始记录构建的偏好数据集。 Result: BLEU分数提升84%（0.173至0.318），BERTScore提高7.6%（0.828至0.891），专家评分在准确性、完整性、逻辑一致性、可读性和结构清晰性上均有显著上升。 Conclusion: DPO可有效将轻量级临床语言模型与专家标准对齐，支持在电子病历系统中实现隐私保护的AI辅助文档生成，减轻临床负担并提升患者安全。 Abstract: Nursing documentation in intensive care units (ICUs) provides essential clinical intelligence but often suffers from inconsistent terminology, informal styles, and lack of standardization, challenges that are particularly critical in heart failure care. This study applies Direct Preference Optimization (DPO) to adapt Mistral-7B, a locally deployable language model, using 8,838 heart failure nursing notes from the MIMIC-III database and 21,210 preference pairs derived from expert-verified GPT outputs, model generations, and original notes. Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert qualitative assessments demonstrates that DPO markedly enhances documentation quality. Specifically, BLEU increased by 84% (0.173 to 0.318), BERTScore improved by 7.6% (0.828 to 0.891), and expert ratings rose across accuracy (+14.4 points), completeness (+14.5 points), logical consistency (+14.1 points), readability (+11.1 points), and structural clarity (+6.0 points). These results indicate that DPO can align lightweight clinical language models with expert standards, supporting privacy-preserving, AI-assisted documentation within electronic health record systems to reduce administrative burden and improve ICU patient safety.

[38] A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

Ziheng Geng,Jiachen Liu,Ran Cao,Lu Cheng,Haifeng Wang,Minghui Cheng

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的多智能体系统，用于自动化二维框架的有限元建模，实验表明其在多数情况下准确率超过80%，优于现有主流模型。

Details

Motivation: 大语言模型在工程领域已展现出潜力，但在结构工程尤其是有限元建模中的应用仍不充分，尤其是在几何建模、复杂推理和领域知识融合方面存在挑战，因此需要开发专用系统以提升自动化水平。 Method: 构建一个基于Llama-3.3 70B Instruct模型的多智能体系统，将结构分析分解为多个子任务：问题分析智能体提取输入参数，几何智能体推导节点与单元连接，翻译智能体生成OpenSeesPy代码，验证智能体进行一致性检查，荷载智能体施加荷载条件。 Result: 在20个基准问题上的实验显示，该系统在10次重复测试中大多数情况下建模准确率超过80%，表现优于Gemini-2.5 Pro和ChatGPT-4o。 Conclusion: 所提出的LLM驱动多智能体系统能有效实现二维框架有限元建模的自动化，具备高准确性与可扩展性，为结构工程中的智能化建模提供了可行路径。 Abstract: Large language models (LLMs) have recently been used to empower autonomous agents in engineering, significantly improving automation and efficiency in labor-intensive workflows. However, their potential remains underexplored in structural engineering, particularly for finite element modeling tasks requiring geometric modeling, complex reasoning, and domain knowledge. To bridge this gap, this paper develops a LLM-based multi-agent system to automate finite element modeling of 2D frames. The system decomposes structural analysis into subtasks, each managed by a specialized agent powered by the lightweight Llama-3.3 70B Instruct model. The workflow begins with a Problem Analysis Agent, which extracts geometry, boundary, and material parameters from the user input. Next, a Geometry Agent incrementally derives node coordinates and element connectivity by applying expert-defined rules. These structured outputs are converted into executable OpenSeesPy code by a Translation Agent and refined by a Model Validation Agent through consistency checks. Then, a Load Agent applies load conditions into the assembled structural model. Experimental evaluations on 20 benchmark problems demonstrate that the system achieves accuracy over 80% in most cases across 10 repeated trials, outperforming Gemini-2.5 Pro and ChatGPT-4o models.

[39] Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Yoo Yongmin,Zhang Xu,Cao Longbing

Main category: cs.CL

TL;DR: 提出Self-Filtered Distillation框架，利用无监督信任度指标过滤LLM生成的专利分类理由，提升模型准确性与可解释性。

Details

Motivation: LLM生成的理由常含逻辑错误和标签不匹配，直接用作监督信号会引入噪声，影响训练稳定性。 Method: 提出Self-Filtered Distillation，结合自一致性、类别蕴含对齐和LLM同意评分三个无监督信任指标，构建统一信任分数，用于加权或过滤训练样本。 Result: 在USPTO-2M数据集上优于标签学习和传统蒸馏方法，提升准确率、稳定性和可解释性。 Conclusion: 该方法为专利分析中利用推理感知的信任信号提供了可靠范式。 Abstract: Large language models (LLMs) increasingly generate natural language rationales to enhance interpretability, but these often contain logical errors, label mismatches, and domain-specific misalignments. Directly using such rationales as supervision risks propagating noise and undermining training stability. To address this challenge, we introduce Self-Filtered Distillation, a framework specifically tailored for patent classification, which treats LLM-generated rationales as trust signals rather than ground-truth supervision. The framework employs selective distillation guided by three unsupervised trust metrics: (1) Self-Consistency, which measures the stability of LLM-generated rationales across multiple generations; (2) Class Entailment Alignment, which assesses semantic coherence with patent-specific class definitions; and (3) LLM Agreement Scoring, which validates rationale-label plausibility. These metrics are integrated into a unified trust score that primarily weights training samples while optionally filtering out extremely low-trust cases, enabling reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used benchmark for patent classification, show that our method outperforms label-based learning and conventional distillation in accuracy, stability, and interpretability, establishing a reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.

[40] SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Yao Dou,Michel Galley,Baolin Peng,Chris Kedzie,Weixin Cai,Alan Ritter,Chris Quirk,Wei Xu,Jianfeng Gao

Main category: cs.CL

TL;DR: 本文提出了SimulatorArena，一个包含909段人类与LLM对话的基准数据集，用于评估模拟用户在数学辅导和文档创建任务中替代真实人类进行LLM助手评价的可靠性。实验表明，基于用户画像的模拟器与人类判断高度一致（Spearman's ρ达0.7），可作为高效、可扩展的人类评估替代方案，并用于评测包括GPT-5、Claude 4.1 Opus和Gemini 2.5 Pro在内的18种助手。

Details

Motivation: 由于人类评估成本高、耗时长且难以复现，现有研究尝试用LLM模拟用户进行自动评估，但缺乏系统性基准来验证其可靠性。因此，需要构建一个标准化的评估框架来衡量模拟用户是否能有效替代真实人类。 Method: 提出SimulatorArena基准，包含两个交互任务（数学辅导和文档创建）的909段标注对话；设计评估指标，衡量模拟用户的消息行为与人类的一致性及其对助手评分与人类判断的相关性；比较多种模拟器方法，特别是基于用户画像（如背景、语言风格）的条件化建模。 Result: 基于用户画像的模拟器在两个任务上与人类判断的Spearman相关系数均达到0.7，显著优于无画像的基线方法；使用最佳模拟器对18个最新LLM助手（如GPT-5、Claude 4.1 Opus、Gemini 2.5 Pro）进行了性能排序。 Conclusion: 具备用户特征建模能力的LLM模拟器能够可靠地模拟人类行为和偏好，在交互式任务中可作为人类评估的有效、可扩展替代方案，SimulatorArena为未来自动化评估提供了重要基准。 Abstract: Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human-LLM conversations on two interactive tasks -- math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman's $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.

[41] AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering

Zheyuan Zhang,Kaiwen Shi,Zhengqing Yuan,Zehong Wang,Tianyi Ma,Keerthiram Murugesan,Vincent Galassi,Chuxu Zhang,Yanfang Ye

Main category: cs.CL

TL;DR: 本文提出了tAgentRouter，一个基于知识图谱引导的多智能体问答路由框架，利用异构图神经网络和性能信号实现任务感知的智能体路由，显著优于单智能体和集成基线。

Details

Motivation: 不同智能体和大模型具有互补优势，且大模型未必更优，现有路由方法忽视了问答任务中的细粒度上下文和关系结构，需要更精细的自适应路由机制。 Method: 将问答实例转化为联合编码问题、上下文实体和智能体的知识图，使用异构图神经网络传播信息，并通过软监督和加权聚合学习任务感知的路由分布。 Result: 在多个基准和大模型骨干上，tAgentRouter consistently 优于单智能体和集成基线，展现出良好的泛化性和性能提升。 Conclusion: 基于知识图谱监督的多智能体路由能有效捕捉智能体间的互补性，是提升复杂问答任务性能的鲁棒且高效方案。 Abstract: Large language models (LLMs) and agent-based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face substantial uncertainty in selecting the best configuration for a downstream task. Prior studies show that different agents and backbones exhibit complementary strengths, and that larger models are not always superior, underscoring the need for adaptive routing mechanisms. Existing approaches to agent routing, however, often emphasize cost efficiency while overlooking the fine-grained contextual and relational structure inherent in QA tasks. In this paper, we propose tAgentRouter, a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals. Specifically, we convert QA instance into a knowledge graph that jointly encodes queries, contextual entities, and agents, and then train a heterogeneous graph neural network (GNN) to propagate information across node types and produce task-aware routing distributions over agents. By leveraging soft supervision and weighted aggregation of agent outputs, AgentRouter learns principled collaboration schemes that capture the complementary strengths of diverse agents. Extensive experiments demonstrate that our framework consistently outperforms single-agent and ensemble baselines, while generalizing across benchmarks and LLM backbones. These results highlight the effectiveness and robustness of graph-supervised multi-agent routing for question answering.

Akhil Deo,Kate Sanders,Benjamin Van Durme

Main category: cs.CL

TL;DR: 本文介绍了SocialNLI（SoNLI），这是首个专注于复杂社交细微差别（如讽刺和反语）的社会对话推理数据集，旨在评估和提升大语言模型在理解社会现象及进行多步反事实推理方面的心智理论能力。

Details

Motivation: 当前的大语言模型在理解对话中的复杂社会现象（如讽刺和反语）方面表现不佳，缺乏足够的社会推理能力，这限制了AI助手的社交智能发展。因此，需要一个专门的数据集来系统评估和改进模型的社交推断能力。 Method: 作者构建了一个名为SocialNLI的新数据集，包含精心挑选的对话转录文本，涵盖复杂的社交细微差别，并为每个样本提供推断、可能性评分以及人工撰写的解释。通过多步反事实推理任务，评估大语言模型和社会推理模型在心智理论方面的能力。 Result: SocialNLI成为首个针对社会对话推理的数据集，能够有效揭示现有模型在理解讽刺、反语等社交现象上的不足，并支持对模型心智理论能力的深入分析。实验表明，当前模型在处理此类任务时仍有显著缺陷。 Conclusion: SocialNLI为评估和提升AI模型的社会理解与推理能力提供了重要工具，推动了具备更强社交智能的AI助手的发展，特别是在涉及心智理论和复杂社会互动的场景中。 Abstract: Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.

[43] TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

Adam Filipek

Main category: cs.CL

TL;DR: 本文提出了TensorBLEU，一种专为GPU上批量token ID快速计算而设计的BLEU指标实现，通过完全向量化和内存高效的n-gram计数方法，在训练过程中显著加速了评估过程。

Details

Motivation: 现有的NLP模型规模不断增大，但评估工具在训练过程中仍存在计算瓶颈，尤其是在强化学习等需要在GPU上对token ID批次进行高效逐句评估的场景下，传统CPU-based评估方法速度慢、内存开销大。 Method: 提出TensorBLEU，基于PyTorch实现完全向量化的BLEU计算，利用torch.unique构建紧凑的批特定n-gram词典，避免传统哈希方法的高内存消耗，实现GPU上的高效并行计算。 Result: 在NVIDIA T4 GPU上比NLTK快13倍以上，在A100上超过40倍，显著降低评估时间，使原本的瓶颈环节变得可忽略。 Conclusion: TensorBLEU为训练过程中的快速评估提供了一种高效、实用的解决方案，特别适用于RL等需要频繁计算奖励信号的场景，有助于加速大模型微调等研究进展。 Abstract: Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

[44] Language Model as Planner and Formalizer under Constraints

Cassie Huang,Stuti Mohan,Ziyi Yang,Stefanie Tellex,Li Zhang

Main category: cs.CL

TL;DR: 本文通过在规划基准中引入细粒度的自然语言约束，揭示了当前大语言模型在复杂环境下的规划能力被高估的问题，并发现这些约束显著降低了模型性能和鲁棒性。

Details

Motivation: 现有规划基准仅包含通用且简单的环境描述，可能导致对大语言模型规划能力的高估，并引发下游任务的安全隐患。因此需要更丰富的约束来评估模型的真实规划能力。 Method: 在广泛使用的规划基准中手动添加涵盖四个形式化类别的精细自然语言约束，并在4种先进推理大模型、3种形式语言、5种方法和4个数据集上进行评估。 Result: 引入约束后，模型性能普遍下降一半以上，且在问题复杂性和词汇变化方面鲁棒性显著降低。 Conclusion: 当前大语言模型在面对真实世界复杂约束时的规划能力有限，未来研究需重视带复杂约束的基准测试以提升实用性和安全性。 Abstract: LLMs have been widely used in planning, either as planners to generate action sequences end-to-end, or as formalizers to represent the planning domain and problem in a formal language that can derive plans deterministically. However, both lines of work rely on standard benchmarks that only include generic and simplistic environmental specifications, leading to potential overestimation of the planning ability of LLMs and safety concerns in downstream tasks. We bridge this gap by augmenting widely used planning benchmarks with manually annotated, fine-grained, and rich natural language constraints spanning four formally defined categories. Over 4 state-of-the-art reasoning LLMs, 3 formal languages, 5 methods, and 4 datasets, we show that the introduction of constraints not only consistently halves performance, but also significantly challenges robustness to problem complexity and lexical shift.

[45] LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation

Zhoutong Fu,Yihan Cao,Yi-Lin Chen,Aman Lunia,Liming Dong,Neha Saraf,Ruijie Jiang,Yun Dai,Qingquan Song,Tan Wang,Guoyao Li,Derek Koh,Haichao Wei,Zhipeng Wang,Aman Gupta,Chengming Jiang,Jianqiang Shen,Liangjie Hong,Wenjing Zhang

Main category: cs.CL

TL;DR: 本文提出了LANTERN，一种专为求职匹配任务设计的大语言模型知识蒸馏框架，通过多目标建模和多层次蒸馏提升性能与效率。

Details

Motivation: 由于领域复杂性和对结构化输出的需求，直接应用开源或微调大模型难以在求职匹配任务中提供高质量、可操作的反馈，且模型体积大导致推理延迟高，难以在线部署。 Method: 提出LANTERN框架，包含用于分类的编码器模型和用于解释的解码器模型，采用多层次知识蒸馏（数据级和logit级）从强教师模型向下游模型传递知识，并结合后训练技术和提示工程优化领域适配。 Result: 实验表明LANTERN显著提升了求职匹配与解释任务的指标，在线评估显示求职者参与度提高，申请率上升0.24%，合格申请增加0.28%。 Conclusion: LANTERN有效解决了大模型在特定领域应用中的性能、延迟和可扩展性问题，为求职匹配场景提供了高效、实用的解决方案。 Abstract: Large language models (LLMs) have achieved strong performance across a wide range of natural language processing tasks. However, deploying LLMs at scale for domain specific applications, such as job-person fit and explanation in job seeking platforms, introduces distinct challenges. At LinkedIn, the job person fit task requires analyzing a candidate's public profile against job requirements to produce both a fit assessment and a detailed explanation. Directly applying open source or finetuned LLMs to this task often fails to yield high quality, actionable feedback due to the complexity of the domain and the need for structured outputs. Moreover, the large size of these models leads to high inference latency and limits scalability, making them unsuitable for online use. To address these challenges, we introduce LANTERN, a novel LLM knowledge distillation framework tailored specifically for job person fit tasks. LANTERN involves modeling over multiple objectives, an encoder model for classification purpose, and a decoder model for explanation purpose. To better distill the knowledge from a strong black box teacher model to multiple downstream models, LANTERN incorporates multi level knowledge distillation that integrates both data and logit level insights. In addition to introducing the knowledge distillation framework, we share our insights on post training techniques and prompt engineering, both of which are crucial for successfully adapting LLMs to domain specific downstream tasks. Extensive experimental results demonstrate that LANTERN significantly improves task specific metrics for both job person fit and explanation. Online evaluations further confirm its effectiveness, showing measurable gains in job seeker engagement, including a 0.24\% increase in apply rate and a 0.28\% increase in qualified applications.

[46] Prototype-Based Dynamic Steering for Large Language Models

Ceyhun Efe Kayan,Li Zhang

Main category: cs.CL

TL;DR: 提出了一种无需修改指令的动态引导方法PDS，通过聚类激活差异生成推理原型，在测试时增强大语言模型的推理能力。

Details

Motivation: 现有大模型推理增强方法依赖显式指令或静态引导方式，缺乏自适应性且需额外指令干预，限制了灵活性和效率。 Method: 引入‘推理原型’概念，通过对比思维链（CoT）与中性提示下的激活差异进行聚类，在推理时将输入的隐藏状态投影到这些原型上生成实例特定的引导向量。 Result: 在GSM8K、AQuA-RAT和BIG-Bench任务上，PDS在不微调和不设计提示的情况下持续提升准确率，且在抑制CoT时仍保持增益。 Conclusion: 基于原型的动态引导是一种轻量级、有效的推理增强方法，能够在不改变模型结构或提示的前提下强化模型内在的推理能力。 Abstract: Despite impressive breadth, LLMs still rely on explicit reasoning instructions or static, one-fits-all steering methods, leaving a gap for adaptive, instruction-free reasoning amplification. We present Prototype-Based Dynamic Steering (PDS), a test-time method that amplifies large language model (LLM) reasoning without adding or altering instructions. We introduce "reasoning prototypes" by clustering activation differences between Chain-of-Thought (CoT) and neutral prompts. At inference, an input's hidden state is projected onto these prototypes to form an instance-specific steering vector. Evaluated on GSM8K, AQuA-RAT, and BIG-Bench tasks, PDS consistently improves accuracy without fine-tuning or prompt engineering. Notably, the gains persist even when CoT is explicitly suppressed to improve cost-efficiency, indicating that the intervention strengthens latent reasoning processes rather than inducing a superficial behavioral shift. These results position dynamic, prototype-guided steering as a lightweight alternative to training-time approaches for enhancing LLM reasoning.

[47] CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Rui Li,Zeyu Zhang,Xiaohe Bo,Zihang Tian,Xu Chen,Quanyu Dai,Zhenhua Dong,Ruiming Tang

Main category: cs.CL

TL;DR: 本文提出了一种受皮亚杰建构主义理论启发的建构主义代理记忆（CAM），旨在提升大语言模型在长文本阅读理解中的记忆能力。

Details

Motivation: 当前的大语言模型在处理长文本时面临信息过载问题，缺乏系统性的记忆模块设计原则。 Method: 基于皮亚杰的建构主义理论，提出了结构化图式、灵活同化和动态顺应三个特性，并据此设计了CAM模型，采用增量重叠聚类算法构建结构化记忆。 Result: CAM在多种长文本理解任务中（如问答、查询式摘要和声明验证）表现出优于现有方法的性能和效率。 Conclusion: CAM为大语言模型提供了一个更强大且高效的记忆系统，推动其向自主阅读代理发展。 Abstract: Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for LLM-based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental overlapping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.

[48] KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Kuangshi Ai,Jonathan A. Karr Jr,Meng Jiang,Nitesh V. Chawla,Chaoli Wang

Main category: cs.CL

TL;DR: 本文提出了KEO，一个在安全关键场景下结合大语言模型和知识图谱的领域特定知识提取与推理框架，通过OMIn数据集上的实验证明其在全局理解任务中优于传统文本分块RAG方法。

Details

Motivation: 在安全关键场景中，传统基于文本分块的检索增强生成（RAG）难以实现跨文档的系统级推理，因此需要一种能够支持全局态势感知和可操作维护任务的知识提取框架。 Method: 提出KEO框架，利用大语言模型从OMIn数据集中构建结构化知识图谱（KG），并将其集成到检索增强生成（RAG）流程中，以支持更连贯、覆盖整个数据集的推理过程。 Result: 实验表明，KEO在全局态势感知任务中显著优于传统文本分块RAG，能揭示模式和系统级洞察；而文本分块RAG在需要局部检索的细粒度操作任务中仍具优势。评估使用了本地部署的LLM（如Gemma-3、Phi-4、Mistral-Nemo），并以GPT-4o和Llama-3.3作为评判模型。 Conclusion: 知识图谱增强的LLM在安全敏感、领域特定的问答和高风险推理任务中具有巨大潜力，KEO为工业运维等复杂场景提供了更可靠的知识推理方案。 Abstract: We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.

[49] H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

Harshil Vejendla

Main category: cs.CL

TL;DR: 本文提出了H1B-KV，一种混合的1比特KV缓存压缩方案，通过1比特二值化表示键和4比特量化值，在保持上下文完整性的前提下大幅降低大语言模型推理时的内存占用，实现70倍的缓存压缩，并在多种下游任务上达到与全精度模型相当的性能。

Details

Motivation: 大语言模型自回归解码过程中需缓存不断增长的KV对，导致长上下文推理成为内存瓶颈。现有方法如量化、丢弃或仅压缩键存在不完整或信息丢失问题，因此需要一种全面且高效的KV缓存压缩方案。 Method: 提出H1B-KV：使用1比特二值草图表示每个键向量，支持硬件友好的位运算注意力；同时对值向量采用4比特量化。结合轻量级微调，实现高效且无损的KV缓存压缩。 Result: 在7B参数规模的LLM上，8k上下文长度下KV缓存内存低于60MB（降低70倍），并在GSM8K、MMLU、HumanEval等任务上匹配全精度模型性能，显著优于KIVI、SparseLLM和Loki等现有方法。 Conclusion: H1B-KV通过混合的1比特键草图和4比特值量化，实现了高效、低内存、无显著性能损失的KV缓存压缩，是面向内存受限环境下部署大模型的强有力解决方案。 Abstract: Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.

[50] On the Role of Difficult Prompts in Self-Play Preference Optimization

Yao Xiao,Jung-jae Kim,Roy Ka-wei Lee,Lidong Bing

Main category: cs.CL

TL;DR: 研究探讨了不同难度提示在自对弈偏好优化中的影响，发现较难的提示会降低优化性能，且加入难提示反而导致整体表现下降；通过选择性去除部分难提示可提升整体性能。

Details

Motivation: 提示在自对弈偏好优化中起关键作用，但其难度的影响尚未被充分探索，本文旨在分析提示难度如何影响优化效果。 Method: 使用N个采样响应的平均奖励作为提示难度的代理指标，比较不同难度提示下的自对弈偏好优化表现，并探索去除难提示对性能的影响。 Result: 难提示显著降低自对弈优化性能，加入难提示导致整体表现轻微下降；随着模型容量增加，难易提示的性能差距缩小；选择性去除部分难提示可提升整体性能。 Conclusion: 提示难度显著影响自对弈偏好优化效果，合理筛选提示有助于提升语言模型对齐效果。 Abstract: Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.

[51] Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

Ryan Solgi,Parsa Madinei,Jiayi Tian,Rupak Swaminathan,Jing Liu,Nathan Susanj,Zheng Zhang

Main category: cs.CL

TL;DR: 提出了一种基于帕累托引导的奇异值分解（PGSVD）方法，用于大语言模型和视觉语言模型的低秩压缩，在保证精度的同时实现更高的压缩率和推理加速。

Details

Motivation: 大型语言模型和视觉语言模型在部署时面临巨大的内存和计算挑战，现有压缩方法缺乏理论支持且效果有限。 Method: 通过层间激活误差上界分析损失变化，将低秩压缩建模为双目标优化问题，并提出PGSVD方法，采用帕累托引导的秩选择与交替最小二乘法实现零样本压缩。 Result: 在LLM和VLM上应用PGSVD，相比现有方法在相同压缩水平下实现了更高的准确性，并提升了推理速度。 Conclusion: PGSVD是一种理论上可解释、实践中高效的模型压缩框架，能够在不牺牲性能的前提下显著降低模型部署成本。 Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

[52] Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Chengzhi Liu,Yuzhe Yang,Kaiwen Zhou,Zhen Zhang,Yue Fan,Yannan Xie,Peng Qi,Xin Eric Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为EvoPresent的自我改进代理框架，用于提升学术论文展示的质量，其核心是结合连贯叙述、审美感知设计和虚拟角色演示，并通过PresAesth多任务强化学习模型实现迭代优化。

Details

Motivation: 现有自动化论文展示方法在叙事性、审美质量和自我调整能力方面存在不足，缺乏有效的评估机制阻碍了性能提升。 Method: 提出EvoPresent框架和PresAesth多任务强化学习审美模型，构建包含650篇顶级AI会议论文和2000张幻灯片对的EvoPresent基准，用于评估生成质量与审美感知能力。 Result: 实验表明：高质量反馈对代理自我改进至关重要；自动生成流程在视觉设计与内容构建间存在权衡；多任务强化学习在审美感知任务中具有更强泛化能力。 Conclusion: EvoPresent通过统一叙事、设计与虚拟呈现，并借助可靠的审美评估与反馈机制，实现了学术展示的高效自优化，为自动化科学传播提供了新范式。 Abstract: The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

[53] Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs

Dong Yan,Gaochen Wu,Bowen Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为反馈引导的动态交互式规划（FGDIP）的新框架，旨在通过动态和自适应的信息探索策略提升大语言模型在开放域多跳推理任务中的表现。

Details

Motivation: 现有方法在处理需要大量信息检索的开放域多跳推理问题时，因依赖固定的行动序列而表现不佳。 Method: FGDIP通过识别问题中的关键实体作为推理起点，并结合历史错误分析与实时反馈生成和优化推理子节点，采用深度优先搜索与创新的节点生成技术进行动态调整。 Result: 实验结果显示，FGDIP在HotpotQA数据集上达到54.47%的F1分数，在StrategyQA上达到70.05%，分别比最佳基线高出5.03%和7.25%。 Conclusion: FGDIP通过动态调整推理策略有效扩展了搜索空间并确保推理过程收敛，展现出在多跳推理任务中增强语言代理能力的巨大潜力。 Abstract: Recent advancements in language agents have led to significant improvements in multi-hop reasoning tasks. However, existing approaches often struggle with handling open-domain problems, which require massive information retrieval due to their reliance on a fixed sequence of actions. To address this, we propose Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive strategies for information exploration in open-domain multi-hop reasoning tasks. Our approach begins by identifying key entities relevant to the problem, which serve as the initial nodes in the reasoning process. From these initial nodes, we then generate reasoning child nodes with the process being refined through a combination of historical error analysis and real-time feedback, which allows the framework to dynamically adjust and optimize its reasoning strategies. By integrating depth-first search with an innovative node generation technique, our framework adapts based on both prior error paths and concurrently generated nodes at the same hierarchical level. This dynamic strategy effectively expands the search space while ensuring the reasoning process systematically converges toward accurate solutions. Experimental results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and 7.25% respectively, highlighting its versatility and potential to enhance language agents in multi-hop reasoning tasks.

[54] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

Shuzheng Si,Haozhe Zhao,Kangyang Luo,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为EAGLET的高效规划器训练方法，通过两步流程增强执行代理在长视野任务中的全局规划能力，无需人工干预，在三个任务上实现了最先进的性能，同时将训练成本降低了8倍。

Details

Motivation: 大型语言模型代理在长视野任务中因缺乏全局规划而容易产生盲目试错和幻觉动作，因此需要一种高效且无需人工参与的规划增强方法。 Method: 提出一种计划-执行框架和EAGLET训练方法：首先利用同源共识过滤策略从高级LLM生成高质量计划并进行微调作为冷启动，然后采用基于规则的强化学习阶段，结合执行器能力增益奖励进一步优化规划器。 Result: 在三个长视野代理任务上的实验表明，配备EAGLET规划器的执行代理优于现有方法，达到最先进性能；相比基于强化学习的基线，训练成本降低8倍，且无需人工标注或额外训练数据。 Conclusion: EAGLET是一种高效、有效且无需人工干预的规划器训练方法，显著提升了代理在复杂长视野任务中的规划能力，具有良好的实用性和可扩展性。 Abstract: Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent's planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.

[55] MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction

Wei-Chieh Huang,Cornelia Caragea

Main category: cs.CL

TL;DR: 提出了一种基于多智能体辩论框架的隐式属性值提取方法，通过多个MLLM智能体迭代辩论来提升多模态电商数据中隐式属性推断的准确性和鲁棒性。

Details

Motivation: 隐式属性值提取在电商中至关重要，但现有方法因多维数据复杂性和视觉-文本理解差距而表现受限，需提升多模态理解能力。 Method: 构建名为\textsc{\modelname}的多智能体辩论框架，多个MLLM智能体通过多轮辩论相互验证和修正推理结果，以迭代方式优化输出。 Result: 在ImplicitAVE数据集上的实验表明，即使少量辩论轮次也能显著提升准确率，尤其改善初始性能较差的属性；不同配置的辩论策略显示出对收敛动态的影响。 Conclusion: 多智能体辩论策略能有效克服单智能体方法的局限，为多模态电商中的隐式属性提取提供了可扩展且鲁棒的解决方案。 Abstract: Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce, as it infers lantent attributes from multimodal data. Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data and gaps in vision-text understanding. In this work, we introduce \textsc{\modelname}, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences. Through a series of debate rounds, agents verify and update each other's responses, thereby improving inference performance and robustness. Experiments on the ImplicitAVE dataset demonstrate that even a few rounds of debate significantly boost accuracy, especially for attributes with initially low performance. We systematically evaluate various debate configurations, including identical or different MLLM agents, and analyze how debate rounds affect convergence dynamics. Our findings highlight the potential of multi-agent debate strategies to address the limitations of single-agent approaches and offer a scalable solution for implicit AVE in multimodal e-commerce.

[56] The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Sheriff Issaka,Keyi Wang,Yinka Ajibola,Oluwatumininu Samuel-Ipaye,Zhaoyi Zhang,Nicte Aguillon Jimenez,Evans Kofi Agyei,Abraham Lin,Rohan Ramachandran,Sadick Abdul Mumin,Faith Nchifor,Mohammed Shuraim,Lieqi Liu,Erick Rosas Gonzalez,Sylvester Kpei,Jemimah Osei,Carlene Ajeneza,Persis Boateng,Prisca Adwoa Dufie Yeboah,Saadia Gabriel

Main category: cs.CL

TL;DR: 本文介绍了非洲语言实验室（All Lab），旨在解决非洲语言在现代自然语言处理技术中严重不足的问题。

Details

Motivation: 非洲语言中有88%被计算语言学忽视或极度缺乏资源，亟需系统性支持以缩小技术差距。 Method: 建立高质量的数据收集管道，构建覆盖40种非洲语言的大规模多模态语音和文本数据集，并通过微调模型进行实验验证，同时开展培养早期研究人员的能力建设计划。 Result: 获得了包含190亿词文本和12,628小时对齐语音的高质量数据集；在31种语言上平均提升+23.69 ChrF++、+0.33 COMET和+15.34 BLEU；部分语言性能可与Google翻译媲美。 Conclusion: All Lab通过数据、模型和人才培养三位一体的方法，有效推动非洲语言的NLP发展，具备可持续性和扩展潜力。 Abstract: Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

[57] Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models

Haneul Yoo,Jiho Jin,Kyunghyun Cho,Alice Oh

Main category: cs.CL

TL;DR: 提出了一种名为代码切换上下文学习（CSICL）的新方法，通过在提示中逐步从目标语言切换到英语，有效缓解大语言模型在非英语语言中的翻译障碍，提升跨语言推理性能。

Details

Motivation: 大语言模型依赖英语作为隐含表征，导致非英语语言性能下降，现有跨语言上下文学习方法难以克服这一翻译障碍。 Method: 引入代码切换上下文学习（CSICL），在指令和示例中从目标语言逐步过渡到英语，显式引导模型的隐含推理过程。 Result: 在4个大模型、6个数据集和10种语言上实验显示，CSICL在目标语言和未见语言上分别提升3.1%p和1.9%p，在低资源语言下提升更显著，达14.7%和5.3%。 Conclusion: CSICL是一种有效且稳健的方法，能够克服大模型在推理时的翻译障碍，推动更公平、高效的多语言系统发展。 Abstract: While large language models (LLMs) exhibit strong multilingual abilities, their reliance on English as latent representations creates a translation barrier, where reasoning implicitly depends on internal translation into English. When this process fails, performance in non-English languages deteriorates sharply, limiting the inclusiveness of LLM-based applications. Existing cross-lingual in-context learning (X-ICL) methods primarily leverage monolingual demonstrations, often failing to mitigate this barrier and instead reinforcing it. In this work, we introduce code-switching in-context learning (CSICL), a simple yet effective prompting strategy that progressively transitions from a target language to English within demonstrations and instruction to facilitate their latent reasoning in English. By explicitly scaffolding the reasoning process through controlled code-switching, CSICL acts as an implicit linguistic bridge that enhances cross-lingual alignment and reduces reliance on the translation barrier. We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains. Our results demonstrate that CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages, respectively. The improvement is even more pronounced in low-resource settings, with gains of 14.7% in target and 5.3% in unseen languages. These findings establish code-switching as a principled and robust approach for overcoming the translation barrier during inference, moving LLMs toward more equitable and effective multilingual systems.

[58] DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

Yongqi Leng,Yikun Lei,Xikai Liu,Meizhi Zhong,Bojian Xiong,Yurong Zhang,Yan Gao,Yi Wu,Yao Hu,Deyi Xiong

Main category: cs.CL

TL;DR: 提出DecEx-RAG，将RAG建模为包含决策与执行的马尔可夫决策过程，并引入高效剪枝策略，显著提升LLM在任务分解、动态检索和答案生成方面的能力。

Details

Motivation: 现有基于结果监督的强化学习方法存在探索效率低、奖励信号稀疏和全局反馈模糊的问题。 Method: 将RAG建模为马尔可夫决策过程（MDP），结合决策与执行阶段，并采用高效的剪枝策略优化数据扩展过程，进行全流程策略优化。 Result: 在六个数据集上平均绝对性能提升6.2%，数据构建效率提高近6倍。 Conclusion: DecEx-RAG有效解决了现有RAG方法在探索效率和奖励反馈方面的局限性，为过程监督的RAG训练提供了高效解决方案。 Abstract: Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at https://github.com/sdsxdxl/DecEx-RAG.

[59] Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

Liza Fretel,Baptiste Cecconi,Laura Debisschop

Main category: cs.CL

TL;DR: 本文提出了一种多源天文观测设施映射生成方法，利用NLP技术和可调标准计算实体匹配分数，并结合大语言模型验证映射合理性，最终生成标准化的同义词集以支持虚拟天文台词汇系统。

Details

Motivation: 为了整合分散在多个语义资源中的天文观测设施信息，实现跨数据源的实体对齐与标准化，提升数据的互操作性和FAIR性（可发现、可访问、可互操作、可重用）。 Method: 通过从Wikidata和天文领域资源中提取实体，利用标签、定义、描述、外部标识符及领域特定属性（如观测波段、发射日期、资助机构等），采用Bag-of-Words、序列和表面特征等NLP方法计算匹配分数，并引入大语言模型进行映射建议的自动审核与解释。 Result: 生成了包含多源同义词集合的映射结果，每个实体仅保留一个标准化标签，支持名称解析服务，并将集成至IVOA词汇表和OntoPortal-Astro平台。 Conclusion: 该方法有效实现了多源天文设施的自动化映射，结合NLP与LLM提升了映射的准确性与可信度，有助于构建统一、标准化的天文知识基础设施。 Abstract: This ongoing work focuses on the development of a methodology for generating a multi-source mapping of astronomical observation facilities. To compare two entities, we compute scores with adaptable criteria and Natural Language Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches, and surface approaches) to map entities extracted from eight semantic artifacts, including Wikidata and astronomy-oriented resources. We utilize every property available, such as labels, definitions, descriptions, external identifiers, and more domain-specific properties, such as the observation wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a Large Language Model (LLM) to accept or reject a mapping suggestion and provide a justification, ensuring the plausibility and FAIRness of the validated synonym pairs. The resulting mapping is composed of multi-source synonym sets providing only one standardized label per entity. Those mappings will be used to feed our Name Resolver API and will be integrated into the International Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro platform.

[60] Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

Main category: cs.CL

TL;DR: 提出了一种基于谱感知的批量选择方法，通过控制InfoNCE梯度的谱特性来加速对比学习训练，并在ImageNet和CIFAR上验证了有效性。

Details

Motivation: 为了加速对比学习中的训练过程，需要更好地理解和控制InfoNCE损失的梯度行为，特别是其与批量数据分布之间的关系。 Method: 推导出非渐近的谱带以约束InfoNCE梯度范数，引入有效秩作为各向异性的代理，并设计了谱感知的批量选择策略（如快速贪心构建器），同时研究了批量白化的效应。 Result: 在ImageNet-100上，Greedy-64比随机选择缩短15%达到67.5% top-1准确率的时间（相比Pool--P3快24%）；CIFAR-10也显示类似增益；批量白化减少了50步梯度方差1.37倍，符合理论上限。 Conclusion: 通过分析和调控批量频谱结构，可以显著提升对比学习的训练效率，谱感知批量选择和白化是有效的加速手段。 Abstract: We derive non-asymptotic spectral bands that bound the squared InfoNCE gradient norm via alignment, temperature, and batch spectrum, recovering the $1/\tau^{2}$ law and closely tracking batch-mean gradients on synthetic data and ImageNet. Using effective rank $R_{\mathrm{eff}}$ as an anisotropy proxy, we design spectrum-aware batch selection, including a fast greedy builder. On ImageNet-100, Greedy-64 cuts time-to-67.5\% top-1 by 15\% vs.\ random (24\% vs.\ Pool--P3) at equal accuracy; CIFAR-10 shows similar gains. In-batch whitening promotes isotropy and reduces 50-step gradient variance by $1.37\times$, matching our theoretical upper bound.

[61] InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience

Jianbin Shen,Christy Jie Liang,Junyu Xuan

Main category: cs.CL

TL;DR: 本文提出了一种基于最优传输和累积联合熵减少的新型学习方法，用于提升抽象文本摘要的信息量，在CNN/Daily Mail和XSum数据集上取得了优于先前方法的结果。

Details

Motivation: 现有的抽象文本摘要方法在信息量方面仍有提升空间，尤其是在大数据时代需要更高效地生成简洁且信息丰富的摘要。 Method: 提出两种方法：基于最优传输的 informative attention 机制以更好地关注参考摘要中的关键信息，以及针对命名实体的累积联合熵减少方法以增强信息显著性。 Result: 在CNN/Daily Mail数据集上ROUGE分数优于先前方法，在XSum上表现具有竞争力，且人工评估显示摘要的信息量更高。 Conclusion: 所提出的模型能有效提升摘要的信息丰富度和质量，验证了关注关键信息和控制命名实体冗余在摘要生成中的重要性。 Abstract: Abstractive text summarization is integral to the Big Data era, which demands advanced methods to turn voluminous and often long text data into concise but coherent and informative summaries for efficient human consumption. Despite significant progress, there is still room for improvement in various aspects. One such aspect is to improve informativeness. Hence, this paper proposes a novel learning approach consisting of two methods: an optimal transport-based informative attention method to improve learning focal information in reference summaries and an accumulative joint entropy reduction method on named entities to enhance informative salience. Experiment results show that our approach achieves better ROUGE scores compared to prior work on CNN/Daily Mail while having competitive results on XSum. Human evaluation of informativeness also demonstrates the better performance of our approach over a strong baseline. Further analysis gives insight into the plausible reasons underlying the evaluation results.

[62] Mixture of Neuron Experts

Runxi Cheng,Yuchen Guan,Yucheng Ding,Qingguo Hu,Yongxian Wei,Chun Yuan,Yelong Shen,Weizhu Chen,Yeyun Gong

Main category: cs.CL

TL;DR: 本文提出了一种新的混合专家模型——神经元级专家混合（MoNE），通过仅激活高激活度的神经元专家，在保持性能的同时显著提升了参数利用效率和推理效率。

Details

Motivation: 观察到传统MoE模型中大部分神经元激活值接近于零，表明存在大量冗余参数，因此希望设计一种更高效的专家选择机制以提升参数利用率和推理效率。 Method: 将专家分解为神经元粒度的MoE，提出MoNE方法，通过在每个专家内部进行简单的top-k选择实现神经元粒度的专家选择，无需额外路由参数或专家间通信。 Result: MoNE在仅激活50% MoE层参数的情况下达到与传统MoE相当的性能，并在相同激活参数数量下持续优于传统MoE。 Conclusion: MoNE是一种实用且高效的方法，能够有效提升MoE类模型的参数利用效率和推理效率。 Abstract: In this work, we first explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to 60% of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than 90% are removed. We further decompose experts into neuron-granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert selection by only applying a simple top-k selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches traditional MoE performance while activating only 50% of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.

[63] Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

Rikuto Kotoge,Yuichi Sasaki

Main category: cs.CL

TL;DR: 提出了一种无需配对数据的token级文本到语音对齐方法TKTO，显著提升了日语TTS的准确性和发音正确率。

Details

Motivation: 现有基于偏好优化的TTS系统依赖成对的优劣样本且局限于utterance级别，难以实现细粒度的发音对齐，限制了模型性能提升。 Method: 提出TKTO方法，无需配对数据，在token级别进行优化，并自动产生细粒度对齐信号，无需token级标注。 Result: 在日语TTS任务中准确率提升39%，CER降低54%，目标token获得的奖励强度自动提升12.8倍。 Conclusion: TKTO实现了更高效、细粒度的偏好优化，显著改善了TTS系统的发音准确性，尤其适用于标注数据稀缺的语言。 Abstract: Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.

[64] EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen,Xueting Han,Qizhou Wang,Bo Han,Jing Bai,Hinrich Schutze,Kam-Fai Wong

Main category: cs.CL

TL;DR: 本文提出了一种名为EEPO的探索增强策略优化框架，通过两阶段 rollout 和自适应遗忘机制来改善大语言模型在强化学习中的探索-利用平衡问题。

Details

Motivation: 现有RLVR方法过度强调利用，导致熵崩溃和探索能力下降，难以跳出主导行为模式，限制了性能提升。 Method: EEPO采用两阶段rollout：第一阶段生成部分轨迹后，进行轻量级遗忘以暂时抑制已采样响应；第二阶段在此基础上继续生成，迫使模型探索输出空间的新区域。该机制打破了重复采样和奖励主导模式的自强化循环。 Result: 在五个推理基准上，EEPO相较于GRPO取得了显著提升：Qwen2.5-3B平均相对增益24.3%，Llama3.2-3B-Instruct提升33.0%，Qwen3-8B-Base提升10.4%。 Conclusion: EEPO通过样本后遗忘机制有效增强了探索能力，缓解了熵崩溃问题，在多种大语言模型上实现了稳定的性能提升，为RLVR中的探索-利用权衡提供了新思路。 Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

[65] Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer

Maxence Lasbordes,Sinoué Gad

Main category: cs.CL

TL;DR: 本文提出了Luth，一个专注于法语的小型语言模型家族，通过针对性的后训练和高质量法语数据，在多个法语基准上超越了同等规模的开源模型，同时保持了原有的英语能力。

Details

Motivation: 现有的多语言模型在法语上的表现明显不如英语，且针对法语的高效适配方法研究有限，因此需要专门优化法语性能的小型语言模型。 Method: 通过对高质量、精选的法语数据进行有针对性的后训练，并结合战略性模型融合方法，提升模型在法语和英语上的表现。 Result: Luth在多个法语基准测试中超过了所有同规模的开源模型，同时保留了原有的英语能力，在法语小型语言模型中达到了新的最先进水平。 Conclusion: Luth为法语小型语言模型设立了新的标杆，是未来法语语言研究的一个强有力基线。 Abstract: The landscape of Large Language Models (LLMs) remains predominantly English-centric, resulting in a significant performance gap for other major languages, such as French, especially in the context of Small Language Models (SLMs). Existing multilingual models demonstrate considerably lower performance in French compared to English, and research on efficient adaptation methods for French remains limited. To address this, we introduce \textbf{Luth}, a family of French-specialized SLMs: through targeted post-training on curated, high-quality French data, our models outperform all open-source counterparts of comparable size on multiple French benchmarks while retaining their original English capabilities. We further show that strategic model merging enhances performance in both languages, establishing Luth as a new state of the art for French SLMs and a robust baseline for future French-language research.

[66] DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Xue-Yong Fu,Elena Khasanova,Md Tahmid Rahman Laskar,Harsh Saini,Shashi Bhushan TN

Main category: cs.CL

TL;DR: 本文探索了通过持续预训练（continual pre-training）来自适应大语言模型（LLMs）以提升其在真实会话文本摘要任务中的表现，采用大规模无标签商业对话数据进行实验，结果表明该方法在领域内和跨领域任务上均显著提升性能，同时保持良好的泛化性和鲁棒性。

Details

Motivation: 大语言模型在通用文本摘要中表现良好，但在专业领域或与预训练分布不同的会话数据上表现欠佳，且依赖昂贵且稀缺的标注数据进行微调，因此需要一种可扩展、自监督的适应方法。 Method: 采用持续预训练策略，利用大规模无标签的商业对话数据对大语言模型进行进一步预训练，并系统评估不同数据选择策略对下游会话摘要任务的影响。 Result: 持续预训练在多个领域内和跨领域的摘要基准上均带来显著性能提升，模型保持良好的泛化能力和对噪声的鲁棒性，同时研究揭示了有效数据选择策略的重要性。 Conclusion: 持续预训练是一种高效、可扩展的自监督方法，适用于将大语言模型适配到特定领域（尤其是嘈杂的真实会话）的摘要任务，在工业应用中具有实用价值。 Abstract: Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains %or conversational data that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.

[67] Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies

Luka Nenadic,David Rodriguez

Main category: cs.CL

TL;DR: 本研究评估了2023年瑞士隐私法修订对合规性的影响，利用多语言基准数据集和基于GPT-5的方法分析隐私政策，发现使用自动化合同生成器的网站合规率显著提高（最高提升15个百分点），表明自动化工具在提升中小企业法律合规方面具有重要作用。

Details

Motivation: 面对日益复杂的数字法规，尤其是资源有限的小型企业难以负担高昂的法律服务，亟需评估低成本替代方案（如自动化合同生成器）的普及程度与输出质量。 Method: 构建并标注涵盖瑞士和欧盟隐私法关键合规要求的多语言基准数据集，采用基于GPT-5的新方法对隐私政策进行大规模合规性评估，并分析法规修订及生成器使用的影响。 Result: 法规修订后整体合规性有所提升；18%的本地网站明确引用了自动化生成器；使用生成器的隐私政策合规率显著更高，最多提升15个百分点。 Conclusion: 自动化合同生成器能有效提升隐私政策的合规水平，支持LLM在跨语言法律分析中的应用，印证了欧盟法规的‘布鲁塞尔效应’，并凸显自动化工具在改善合同质量和促进法规遵从方面的关键作用。 Abstract: It has become increasingly challenging for firms to comply with a plethora of novel digital regulations. This is especially true for smaller businesses that often lack both the resources and know-how to draft complex legal documents. Instead of seeking costly legal advice from attorneys, firms may turn to cheaper alternative legal service providers such as automated contract generators. While these services have a long-standing presence, there is little empirical evidence on their prevalence and output quality. We address this gap in the context of a 2023 Swiss privacy law revision. To enable a systematic evaluation, we create and annotate a multilingual benchmark dataset that captures key compliance obligations under Swiss and EU privacy law. Using this dataset, we validate a novel GPT-5-based method for large-scale compliance assessment of privacy policies, allowing us to measure the impact of the revision. We observe compliance increases indicating an effect of the revision. Generators, explicitly referenced by 18% of local websites, are associated with substantially higher levels of compliance, with increases of up to 15 percentage points compared to privacy policies without generator use. These findings contribute to three debates: the potential of LLMs for cross-lingual legal analysis, the Brussels Effect of EU regulations, and, crucially, the role of automated tools in improving compliance and contractual quality.

[68] Revisiting Long-context Modeling from Context Denoising Perspective

Zecheng Tang,Baibei Ji,Juntao Li,Lijun Wu,Haijia Gui,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于集成梯度（IG）分数的上下文去噪训练（CDT）方法，用于检测和减轻长上下文模型中的噪声干扰，显著提升了模型对关键信息的关注和预测性能。

Details

Motivation: 长上下文模型容易受到上下文中无关噪声的干扰，影响注意力机制和预测效果，因此需要一种细粒度的方法来识别并缓解此类噪声。 Method: 提出使用集成梯度（IG）分数作为衡量上下文噪声的指标，并在此基础上设计了上下文去噪训练（CDT）策略，在训练过程中增强模型对关键令牌的注意力和影响力。 Result: 在四种任务和不同长上下文设置下的实验表明，CDT显著提升模型性能，使用CDT训练的8B开源模型性能接近GPT-4o（50.92 vs 51.00）。 Conclusion: CDT是一种简单而有效的训练策略，能够有效减轻上下文噪声，增强长上下文模型的关键信息处理能力。 Abstract: Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

[69] Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

Faeze Ghorbanpour,Alexander Fraser

Main category: cs.CL

TL;DR: 该研究系统评估了大语言模型在长上下文场景下对有害内容的敏感性，发现模型对中等比例的有害内容检测效果最佳，但随着上下文增长或内容过于稀疏/密集，性能下降。

Details

Motivation: 现有研究多关注大模型在推理和检索中的长上下文能力，但在安全关键场景下的行为尚不清楚，因此需要系统评估其对不同类型、位置、比例和长度上下文中有害内容的识别能力。 Method: 通过控制有害内容的类型（显性 vs. 隐性）、位置（开头、中间、结尾）、占比（0.01-0.50）和上下文长度（600-6000词元），在LLaMA-3、Qwen-2.5和Mistral等模型上评估其对毒性、冒犯性和仇恨言论等有害内容的检测表现。 Result: 模型在中等有害内容比例（0.25）时表现最佳；上下文越长，查全率越低；开头的有害句子更易被检测；显性内容比隐性内容更容易被识别。不同模型在各类有害内容上呈现相似模式。 Conclusion: 大语言模型在长上下文中的有害内容识别存在明显局限，尤其在内容稀疏、隐性或上下文较长时表现下降，需进一步优化以满足安全关键应用的需求。 Abstract: Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.

[70] The fragility of "cultural tendencies" in LLMs

Kun Sun,Rong Wang

Main category: cs.CL

TL;DR: 本文批评了Lu, Song, 和 Zhang (2025) 关于大语言模型在不同语言提示下表现出文化特有倾向的结论，认为其发现的文化倾向并非稳定特征，而是特定模型和任务设计的脆弱产物。作者通过更广泛的模型和更多测试项的复制实验，发现提示语言对输出影响甚微，挑战了原研究中模型编码了深层文化信念的说法。

Details

Motivation: 质疑先前研究中关于大语言模型因提示语言不同而表现出文化特有倾向的结论，重新评估其方法论、理论框架和结论的有效性。 Method: 通过使用更广泛的大语言模型集合和更大数量的测试项目进行有针对性的复制实验，检验提示语言是否真正引发文化倾向的变化。 Result: 实验结果显示提示语言对模型输出的影响极小，表明所谓的‘文化倾向’并不稳定，且高度依赖于具体模型和任务设计。 Conclusion: 大语言模型在不同语言提示下的行为差异并非源于深层文化信念，而是实验设计和模型特性的产物；因此，不能简单归结为文化驱动的响应模式。 Abstract: In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large language models (LLMs), when prompted in different languages, display culturally specific tendencies. They report that the two models (i.e., GPT and ERNIE) respond in more interdependent and holistic ways when prompted in Chinese, and more independent and analytic ways when prompted in English. LSZ attribute these differences to deep-seated cultural patterns in the models, claiming that prompt language alone can induce substantial cultural shifts. While we acknowledge the empirical patterns they observed, we find their experiments, methods, and interpretations problematic. In this paper, we critically re-evaluate the methodology, theoretical framing, and conclusions of LSZ. We argue that the reported "cultural tendencies" are not stable traits but fragile artifacts of specific models and task design. To test this, we conducted targeted replications using a broader set of LLMs and a larger number of test items. Our results show that prompt language has minimal effect on outputs, challenging LSZ's claim that these models encode grounded cultural beliefs.

[71] Prompt reinforcing for long-term planning of large language models

Hsien-Chin Lin,Benjamin Matthias Ruppik,Carel van Niekerk,Chia-Hao Shen,Michael Heck,Nurul Lubis,Renato Vukovic,Shutong Feng,Milica Gašić

Main category: cs.CL

TL;DR: 提出了一种受强化学习启发的提示优化框架，通过改写任务指令提示来实现长期规划，显著提升了大语言模型在多轮任务中的表现。

Details

Motivation: 大语言模型在多轮交互中表现不佳，常依赖错误的早期假设且难以持续跟踪用户目标。 Method: 设计了一种基于强化学习思想的提示优化框架，通过生成逐轮反馈并利用经验回放机制重写提示。 Result: 在文本到SQL和任务导向对话等多轮任务中表现出显著性能提升，并能泛化到不同的大语言模型代理，支持多种大语言模型作为元提示代理。 Conclusion: 该方法为无需参数调整的强化学习启发式优化提供了新方向，值得进一步研究。 Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

[72] Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

Mai AlKhamissi,Yunze Xiao,Badr AlKhamissi,Mona Diab

Main category: cs.CL

TL;DR: 本文提出一个四部分框架来分类文化基准测试如何定义文化，并通过该框架分析了20个现有文化基准，揭示了六类常见方法论问题，结合人类学方法提出了改进建议。

Details

Motivation: 当前的文化评估基准往往将文化简化为静态事实或同质化价值观，忽略了文化的动态性和实践性，与人类学观点相悖，因此需要更准确反映文化复杂性的评估方式。 Method: 提出一个四维度框架（知识、偏好、表现、偏见）对文化进行分类，并以此为基础对20个文化基准进行定性分析，识别其方法论缺陷；结合人类学方法提出改进策略。 Result: 识别出六个普遍存在的方法论问题，如将国家等同于文化、忽视文化内部多样性、依赖过于简化的调查形式等；提出了引入真实叙事、让文化群体参与设计、在实际语境中评估等具体改进建议。 Conclusion: 文化基准应超越静态的知识回忆任务，采用更具情境化和参与式的方法，以更准确地评估大语言模型在复杂文化情境中的响应能力。 Abstract: Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.

[73] EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi,Anastasia Giachanou,Ayoub Bagheri

Main category: cs.CL

TL;DR: 提出EvalMORAAL框架，使用两种评分方法和模型评审评估20个大语言模型的道德对齐性，发现西方地区与非西方地区的显著差异，揭示跨区域应用的文化偏见问题。

Details

Motivation: 为了更透明、公正地评估大语言模型在不同文化背景下的道德对齐性，解决现有评估方法缺乏可比性和文化敏感性的问题。 Method: 采用基于链式思维（CoT）的框架，结合对数概率和直接评分两种打分方式，并引入模型作为评审的同行评审机制，在世界价值观调查和PEW全球态度调查数据上进行验证。 Result: 顶级模型在整体上与调查结果高度一致（WVS上皮尔逊r≈0.90），但存在明显区域差异：西方地区平均r=0.82，非西方地区r=0.61；同行评审识别出348个冲突，且评审一致性与调查对齐度显著相关。 Conclusion: EvalMORAAL实现了对道德对齐的可比较、透明评估，显示出向文化感知AI的进步，但也暴露出模型在非西方地区的对齐不足，提示需进一步解决跨区域偏差问题。 Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

[74] Probing the Difficulty Perception Mechanism of Large Language Models

Sunbowen Lee,Qingyu Yin,Chak Tou Leong,Jialiang Zhang,Yicheng Gong,Xiaoyu Shen

Main category: cs.CL

TL;DR: 研究发现大语言模型（LLM）在其内部表征中隐含地编码了问题难度，可通过线性探针和特定注意力头识别数学问题的难易程度，揭示了LLM具备结构化的难度感知能力，可用于自动标注难度，减少对人工标注的依赖。

Details

Motivation: 探究大语言模型是否能内隐感知问题难度，以支持自适应推理和高效资源分配。 Method: 使用线性探针对LLM最后token的表征建模，并定位Transformer最后一层中与难度感知相关的特定注意力头，通过消融实验验证其准确性。 Result: 成功用线性模型预测数学问题难度，发现特定注意力头对简单和困难问题有相反激活模式，且在token级别上难度与熵存在显著差异。 Conclusion: LLM不仅具备内部难度感知能力，且该能力具有结构性，为构建自动难度标注系统和未来理论研究提供了新方向。 Abstract: Large language models (LLMs) are increasingly deployed on complex reasoning tasks, yet little is known about their ability to internally evaluate problem difficulty, which is an essential capability for adaptive reasoning and efficient resource allocation. In this work, we investigate whether LLMs implicitly encode problem difficulty in their internal representations. Using a linear probe on the final-token representations of LLMs, we demonstrate that the difficulty level of math problems can be linearly modeled. We further locate the specific attention heads of the final Transformer layer: these attention heads have opposite activation patterns for simple and difficult problems, thus achieving perception of difficulty. Our ablation experiments prove the accuracy of the location. Crucially, our experiments provide practical support for using LLMs as automatic difficulty annotators, potentially substantially reducing reliance on costly human labeling in benchmark construction and curriculum learning. We also uncover that there is a significant difference in entropy and difficulty perception at the token level. Our study reveals that difficulty perception in LLMs is not only present but also structurally organized, offering new theoretical insights and practical directions for future research.

[75] LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Periklis Mantenoglou,Rishi Hazra,Pedro Zuidberg Dos Martires,Luc De Raedt

Main category: cs.CL

TL;DR: 本文提出了LexiCon，一个基于自然语言的约束规划基准，用于评估大语言模型在带有时序约束的规划任务中的表现，具有可扩展性和未来适应性。

Details

Motivation: 为了将大语言模型部署到需要严格遵守安全等约束条件的真实场景中，需系统评估其在约束规划任务上的能力，而现有研究多集中于无约束环境。 Method: 通过在现有规划环境中引入自动构建的时序约束，并将其转化为自然语言描述，形成新的基准测试集LexiCon，支持新环境的扩展。 Result: 实验表明，随着规划任务约束程度增加，包括GPT-5、o3和R1在内的最先进大语言模型的性能显著下降。 Conclusion: LexiCon为评估大语言模型在约束规划中的表现提供了有效工具，揭示了当前模型在处理复杂约束时的不足，指明了未来改进方向。 Abstract: Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon -- a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

[76] Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

Timothy Pistotti,Jason Brown,Michael Witbrock

Main category: cs.CL

TL;DR: 该论文探讨了评估大语言模型（LLM）句法习得能力的不同度量方法，发现基于直接最小对比较的“wh效应”比差分差（DiD）度量更具诊断透明性。通过系统分析GPT-2在寄生间隙（PG）环境中的表现，研究显示其在所有测试条件下均成功掌握填空依赖关系，表明其具备稳健的句法知识。

Details

Motivation: 不同研究使用不同度量方法（如wh效应与DiD）对LLM句法习得能力得出矛盾结论，因此需要更清晰、可靠的评估方式来准确判断模型的句法理解能力。 Method: 构建了一个包含8种排列组合的精细化寄生间隙刺激范式，并采用Wilcox等人提出的wh效应方法对GPT-2模型进行系统性评估，对比此前使用的DiD度量结果。 Result: GPT-2在全部四个测试条件下均表现出成功的填空依赖识别能力，显示出对复杂寄生间隙结构的良好掌握，而这一结果在DiD度量下曾被视为不明确或失败。 Conclusion: 评估度量的选择显著影响对LLM句法能力的判断，直接最小对比较（wh效应）比DiD更具诊断力，应优先用于句法知识探测。 Abstract: Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.

[77] MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

Qin Dong,Yuntian Tang,Heming Jia,Yunhang Shen,Bohan Jia,Wenxuan Huang,Lianyue Zhang,Jiao Xie,Shaohui Lin

Main category: cs.CL

TL;DR: 提出MASA（Multi-A Shared Adaptation）架构，通过多A单B结构和跨层不对称共享专家机制，在保持参数效率的同时提升特征表达能力，在MMLU等任务上优于标准LoRA。

Details

Motivation: LoRA中单一的下投影矩阵A存在表示瓶颈，难以捕捉复杂任务所需的多样化信号，因此需要增强特征适应能力以提升下游任务性能。 Method: 提出MASA架构，采用多A单B结构，多个A矩阵作为专业化专家提取多样化特征，并在不同层间不对称共享以保证参数效率，由单个层特定的B矩阵整合特征。 Result: 在多领域泛化、单领域特化和多任务推理等实验中验证了MASA的有效性；在MMLU基准上达到59.62%的平均准确率，比标准LoRA提升1.08个百分点（相对提升1.84%），可学习参数仅为0.52%。 Conclusion: MASA通过丰富特征适配机制有效缓解了LoRA的表示瓶颈，在相近参数量下显著提升模型性能，展现出更强的适应能力和应用潜力。 Abstract: Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA's reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.

[78] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Xiangyu Peng,Cab Qin,Zeyuan Chen,Ran Xu,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本文提出了UniDoc-Bench，首个大规模、真实场景下的多模态检索增强生成（MM-RAG）基准，基于70,000页真实PDF文档构建，涵盖八个领域，并生成1,600个多模态问答对，支持文本、图像及联合检索的统一评估。

Details

Motivation: 现有MM-RAG评估碎片化，仅关注文本或图像单一模态，或简化设置无法反映以文档为中心的真实多模态应用场景，缺乏统一、可靠的评估基准。 Method: 提出UniDoc-Bench构建 pipeline：从真实PDF中提取并关联文本、表格和图表信息，生成涵盖事实检索、比较、摘要和逻辑推理的多模态QA对；20%样本经多人标注与专家仲裁确保质量；支持四种范式在统一协议下的公平比较。 Result: 实验表明，多模态文本-图像融合RAG系统性能持续优于单模态及当前基于联合嵌入的多模态检索方法；揭示了视觉上下文如何补充文本证据，发现系统性失败模式。 Conclusion: 文本或图像单独均不足以支撑复杂MM-RAG任务，当前多模态嵌入仍不充分；UniDoc-Bench为评估和改进MM-RAG系统提供了可靠基准与实用指导。 Abstract: Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

[79] Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Timothy Pistotti,Jason Brown,Michael Witbrock

Main category: cs.CL

TL;DR: 该研究重新评估了大语言模型（如GPT-2）在句法预测任务中的表现，提出先前研究所用的刺激材料存在词汇歧义和结构复杂性等混淆因素。通过构建更优质的精炼数据集（基于语言学指导模板并由SOTA生成模型Gemini 2.5 Pro生成），发现GPT-2在新数据上表现显著提升，表明刺激质量对LLM句法能力评估结果有重要影响。

Details

Motivation: 近期使用大语言模型（LLM）检验‘刺激贫乏论’（APS）的研究在不同句法现象上得出矛盾结果，可能源于实验所用刺激材料中的词汇歧义和结构复杂性等混淆因素，因此需要更严谨的方法来评估LLM的真实句法能力。 Method: 1) 在已有（过滤与未过滤）刺激材料上建立GPT-2的基线表现；2) 利用语言学指导的模板，通过当前最先进的生成式大模型Gemini 2.5 Pro Preview生成新的、更高质量的刺激数据集（PG stimuli），以减少混淆因素；3) 比较GPT-2在新旧数据上的表现差异。 Result: 初步结果显示，GPT-2在新构建的精炼刺激数据集上的表现显著优于基线，尤其是在 surprisal-based 的句法预测任务中表现更好。 Conclusion: 刺激材料的质量显著影响大语言模型句法能力评估的结果，改进刺激设计有助于更准确地衡量LLM的句法知识，未来研究应更加关注实验材料的控制与优化。 Abstract: Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

[80] CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Chengwei Wu,Jiapu Wang,Mingyang Gao,Xingrui Zhuo,Jipeng Guo,Runlin Lei,Haoran Luo,Tianyu Chen,Haoyi Zhou,Shirui Pan,Zechao Li

Main category: cs.CL

TL;DR: 本文提出了一个用于评估中文大语言模型的综合基准CB-ECLLM，基于新构建的中文数据-文本对（CDTP）数据集，包含700多万个文本对和1500万条三元组，覆盖四个关键领域，旨在解决中文LLM缺乏结构化数据和针对性评估的问题。

Details

Motivation: 中文大语言模型面临语料中缺乏结构化表示的问题，现有评测基准以英文为主，无法充分评估中文语言特性，因此需要一个专门针对中文的、支持知识驱动任务的高质量评测基准。 Method: 构建了大规模中文数据-文本对（CDTP）数据集，包含700多万个对齐的文本-三元组对，并在此基础上设计综合基准CB-ECLLM，支持知识图谱补全、三元组到文本生成和问答等多任务评估，通过监督微调和消融实验验证其有效性。 Result: CDTP数据集显著丰富了中文语料的结构化信息，CB-ECLLM实现了对中文大模型在知识驱动任务上的细粒度评估，并支持多任务微调，实验证明该基准具有良好的有效性和鲁棒性。 Conclusion: CB-ECLLM为中文大语言模型提供了首个基于结构化数据的综合性评测基准，推动了中文LLM在知识理解与生成任务上的可评估性与可复现性研究。 Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

[81] ASPO: Asymmetric Importance Sampling Policy Optimization

Jiakang Wang,Runze Liu,Lei Lin,Wenping Hu,Xiu Li,Fuzheng Zhang,Guorui Zhou,Kun Gai

Main category: cs.CL

TL;DR: 提出ASPO方法，通过修正重要性采样比和引入软双剪切机制，解决LLM强化学习中正负token更新不平衡问题，提升训练稳定性和性能。

Details

Motivation: 现有基于结果监督的强化学习方法在token级别使用重要性采样时存在正负优势token权重不匹配的问题，导致低概率token更新受抑制、高概率token过度放大。 Method: 提出Asymmetric Importance Sampling Policy Optimization (ASPO)，对正优势token反转重要性采样比，并引入软双剪切机制以稳定极端更新同时保持梯度流动。 Result: 在编程和数学推理基准上的实验表明，ASPO显著缓解了早停现象，提高了训练稳定性，并优于强基线方法（如GRPO）。 Conclusion: 正确处理重要性采样比对LLM的强化学习至关重要，ASPO为token级加权提供了新视角，并有效改善了训练动态。 Abstract: Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

[82] Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

Taylor Sorensen,Benjamin Newman,Jared Moore,Chan Park,Jillian Fisher,Niloofar Mireshghallah,Liwei Jiang,Yejin Choi

Main category: cs.CL

TL;DR: 本文研究了语言模型后训练在提升指令遵循能力的同时，对多解任务中条件分布建模带来的负面影响，提出了三个关键需求：上下文可引导性、有效输出空间覆盖和分布对齐，并发现现有后训练方法会削弱这些性质。为此，作者构建了Spectrum Suite评测集，并提出Spectrum Tuning方法来改善模型的可引导性和分布建模能力。

Details

Motivation: 现有的语言模型后训练虽提升了性能，但在需灵活生成多样答案的任务上可能损害模型对不同分布的适应能力，缺乏对条件分布建模的系统评估与优化。 Method: 提出三个条件分布建模的评估维度，构建包含90多个任务的大规模评测集Spectrum Suite，并提出Spectrum Tuning这一新的后训练方法，以提升模型在分布多样性与对齐方面的表现。 Result: 实验表明，当前的后训练方法虽有助于激发模型已有知识，但会降低其上下文可引导性；而Spectrum Tuning能显著提升模型在输出多样性、分布对齐和可引导性方面的表现，优于预训练和指令微调模型。 Conclusion: 应重视语言模型在多解任务中的分布建模能力，Spectrum Tuning为提升模型灵活性和真实分布匹配提供了有效路径。 Abstract: Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. We characterize three desiderata for conditional distributional modeling: in-context steerability, valid output space coverage, and distributional alignment, and document across three model families how current post-training can reduce these properties. In particular, we disambiguate between two kinds of in-context learning: ICL for eliciting existing underlying knowledge or capabilities, and in-context steerability, where a model must use in-context information to override its priors and steer to a novel data generating distribution. To better evaluate and improve these desiderata, we introduce Spectrum Suite, a large-scale resource compiled from >40 data sources and spanning >90 tasks requiring models to steer to and match diverse distributions ranging from varied human preferences to numerical distributions and more. We find that while current post-training techniques help elicit underlying capabilities and knowledge, they hurt models' ability to flexibly steer in-context. To mitigate these issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite to improve steerability and distributional coverage. We find that Spectrum Tuning often improves over pretrained models and their instruction-tuned counterparts, enhancing steerability, spanning more of the output space, and improving distributional alignment on held-out datasets.

[83] The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Muyu He,Muhammad Ali Shafique,Anand Kumar,Tsach Mackey,Nazneen Rajani

Main category: cs.CL

TL;DR: 本文研究了在竞争性编程技能蒸馏中，小模型性能随蒸馏数据量变化的趋势，发现存在“代码推理谷”现象：性能先降后升，并揭示了简单题目和输出正确性对蒸馏效果的影响。

Details

Motivation: 探索小型语言模型在代码推理能力蒸馏过程中，性能如何随蒸馏数据量变化，填补该领域 Scaling 趋势研究的空白。 Method: 通过在两个小型非推理LLM上进行实验，分析不同数据量下的蒸馏效果，并在不同阶段进行微调以探究学习过程；同时比较使用难易不同的编程题目和正确/错误输出数据的影响。 Result: 发现了“代码推理谷”现象——性能随数据增加先下降后急剧上升；小模型在低至中低数据环境下更受益于简单题目；训练数据中输出的正确性对蒸馏结果无显著影响。 Conclusion: 小型模型在代码推理蒸馏中的学习过程具有阶段性特征，且某些直觉假设（如需要正确输出）可能不成立，为理解代码推理蒸馏的训练动态提供了新视角。 Abstract: Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

[84] Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Gagan Bhatia,Somayajulu G Sripada,Kevin Allan,Jacobo Azcona

Main category: cs.CL

TL;DR: 本文提出了一种名为分布语义追踪（DST）的统一框架，用于分析大语言模型中幻觉现象的内在机制，识别出导致幻觉不可避免的“承诺层”，并揭示了快速联想路径与慢速上下文路径之间的冲突是幻觉产生的根本原因。

Details

Motivation: 大语言模型容易产生幻觉，即生成看似合理但事实错误的内容，这限制了其在关键场景中的应用。本文旨在从模型架构内部探究幻觉产生的根本原因。 Method: 提出了分布语义追踪（DST）框架，结合可解释性技术构建模型推理的因果图；通过追踪语义表示的变化定位幻觉发生的‘承诺层’；引入双过程理论视角，分析联想路径与 contextual 路径之间的竞争机制，并量化上下文路径的连贯性。 Result: 发现了模型中存在一个特定的‘承诺层’，在此层后语义偏离事实；识别出‘推理捷径劫持’等可预测的失败模式；上下文路径连贯性与幻觉率呈强负相关（ρ = -0.863），表明幻觉是内部语义弱化的结果。 Conclusion: 幻觉是Transformer架构中语义处理机制失衡的可预测结果，源于快速联想与慢速推理路径的冲突，DST框架为理解、检测和缓解幻觉提供了新的工具和理论基础。 Abstract: Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific \textbf{commitment layer} where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic \textbf{associative pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway} (akin to System 2), leading to predictable failure modes such as \textit{Reasoning Shortcut Hijacks}. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($\rho = -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

[85] Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

Muhammad Dehan Al Kautsar,Fajri Koto

Main category: cs.CL

TL;DR: 本文提出了一种名为“并行分词器”的新框架，通过双语词典对单语训练的分词器进行词汇对齐，使语义相同的词在不同语言中具有统一的表示，从而提升低资源语言下的跨语言迁移效果。

Details

Motivation: 现有分词方法常导致语义相同的词在不同语言中被分配不同的嵌入，阻碍了跨语言迁移，尤其影响低资源语言的表现。因此需要一种能建立共享语义空间的分词方法。 Method: 分别对每种语言单独训练分词器，然后利用双语词典或逐词翻译对不同分词器的词汇表进行彻底对齐，确保语义等价的词拥有相同的词汇索引。 Result: 在13种低资源语言上从头预训练Transformer编码器，并在情感分析、仇恨言论检测、情感分类和句子相似度任务中均优于传统多语言基线模型。 Conclusion: 重新设计分词方式对于提升多语言表示学习至关重要，尤其是并行分词器能有效促进跨语言泛化，改善低资源语言的模型性能。 Abstract: Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, "I eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning--especially in low-resource settings.

[86] CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits

Kangyu Wang,Zhiyun Jiang,Haibo Feng,Weijia Zhao,Lin Liu,Jianguo Li,Zhenzhong Lan,Weiyao Lin

Main category: cs.CL

TL;DR: 本文提出了Trace Credit和CreditDecoding算法，通过利用历史logits信息加速扩散大语言模型的并行解码过程，显著减少冗余迭代，在多个基准上实现了显著的速度提升和性能改进。

Details

Motivation: 现有扩散大语言模型在解码过程中因初始置信度低导致重复掩码和冗余迭代，限制了解码速度，因此需要一种能有效利用历史信息以加快收敛的方法。 Method: 提出Trace Credit概念，通过累积历史logits来量化每个token的收敛潜力，并设计无需训练的并行解码算法CreditDecoding，融合当前logits与Trace Credit以加速正确但低置信度token的置信度收敛。 Result: 在八个基准测试中，相比LLaDA-8B-Instruct实现了5.48倍加速和0.48性能提升，相比LLaDA-MoE-Instruct实现了4.11倍加速和0.15性能提升，且对长序列具有良好扩展性。 Conclusion: CreditDecoding是一种高效、无需训练、可集成于主流推理优化的并行解码方法，能显著提升扩散大语言模型的解码效率与鲁棒性。 Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising steps, achieving parallel decoding by denoising only high-confidence positions at each step. However, existing approaches often repetitively remask tokens due to initially low confidence scores, leading to redundant iterations and limiting overall acceleration. Through the analysis of dLLM decoding traces, we observe that the model often determines the final prediction for a token several steps before the decoding step. To leverage this historical information and avoid redundant steps, we introduce the concept of Trace Credit, which quantifies each token's convergence potential by accumulating historical logits. Furthermore, we propose CreditDecoding, a training-free parallel decoding algorithm that accelerates the confidence convergence of correct but underconfident tokens by fusing current logits with Trace Credit. This process significantly reduces redundant iterations and enhances decoding robustness. On eight benchmarks, CreditDecoding achieves a 5.48 times speedup and a 0.48 performance improvement over LLaDA-8B-Instruct, and a 4.11 times speedup with a 0.15 performance improvement over LLaDA-MoE-Instruct. Importantly, CreditDecoding scales effectively to long sequences and is orthogonal to mainstream inference optimizations, making it a readily integrable and versatile solution.

[87] RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

Jan Cegin,Branislav Pecher,Ivan Srba,Jakub Simko

Main category: cs.CL

TL;DR: 提出了一种名为RoSE的代理指标，用于在无人工标注测试集的情况下选择最适合生成合成数据的LLM，在多语言、多任务场景下表现优于传统内在指标，且与下游性能具有正相关性。

Details

Motivation: 由于低资源语言缺乏人工标注数据，难以通过传统外在评估选择最佳LLM作为数据生成器，而现有内在指标与下游性能相关性差，因此需要一种无需人工测试集的有效代理指标。 Method: 提出Round robin Synthetic data Evaluation（RoSE）：用候选LLM生成的数据训练一个小模型，并在其他候选LLM生成的合成数据上进行评估，最终得分是该小模型在所有候选者上的平均性能。 Result: 在六个LLM、十一种语言和三个任务上，RoSE比其他内在启发式方法更频繁地识别出最优生成器，其下游性能距离最优基线仅差0.76个百分点，并且是唯一与人工测试集性能呈正相关的指标。 Conclusion: RoSE是一种有效且可靠的代理指标，能够在缺乏人工标注数据的情况下准确选择适合生成训练数据的LLM，特别适用于低资源语言场景。 Abstract: LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

[88] VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Dingyu Yao,Chenxu Yang,Zhengyang Tong,Zheng Lin,Wei Liu,Jian Luan,Weiping Wang

Main category: cs.CL

TL;DR: VecInfer提出了一种新的向量量化方法，通过平滑和Hadamard变换抑制键缓存中的异常值，实现高效的KV缓存压缩与低比特推理，在2-bit量化下性能接近全精度，并显著提升推理速度。

Details

Motivation: 现有向量量化方法在极低比特宽度下因键缓存异常值导致码本利用率低，性能下降严重，难以有效压缩KV缓存。 Method: 引入平滑和Hadamard变换来抑制键缓存中的异常值，提升码本覆盖率；设计融合计算与反量化操作的优化CUDA内核以减少内存访问开销。 Result: 在Llama-3.1-8B模型上，使用2-bit量化时性能接近全精度，大批次自注意力计算最高加速2.7倍，单批次端到端延迟降低8.3倍（序列长度196k）。 Conclusion: VecInfer有效解决了低比特KV缓存量化中的异常值问题，在保持模型性能的同时显著提升了推理效率和内存压缩比。 Abstract: The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

[89] Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Yoav Gur-Arieh,Mor Geva,Atticus Geiger

Main category: cs.CL

TL;DR: 该研究发现语言模型在上下文推理中通过位置、词汇和反射三种机制绑定和检索实体，提出了一种结合这三种机制的因果模型，能够以95%的一致性预测下一个词的分布，并在更长、更自然的文本输入中表现出良好的泛化能力。

Details

Motivation: 理解语言模型如何在复杂上下文中有效绑定和检索实体，尤其是在现有基于位置的机制在实体增多时表现不佳的情况下。 Method: 通过在九个模型和十项绑定任务上的大量实验，分析语言模型在不同机制（位置、词汇、反射）下的行为，并构建一个融合三种机制的因果模型来估计下一个词的分布。 Result: 发现了语言模型在实体绑定中混合使用三种机制的稳定模式；提出的因果模型在标准设置下达到95%的预测一致性，并能在更长、开放性的文本中良好泛化。 Conclusion: 语言模型不仅仅依赖位置机制进行实体绑定与检索，而是结合了词汇和反射机制来弥补其不足；本研究提供了对语言模型上下文推理机制更全面的理解。 Abstract: A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

[90] RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Chunyu Miao,Henry Peng Zou,Yangning Li,Yankai Chen,Yibo Wang,Fangxin Wang,Yifan Li,Wooseong Yang,Bowei He,Xinni Zhang,Dianzhi Yu,Hanchen Yang,Hoang H Nguyen,Yue Zhou,Jie Yang,Jizhou Guo,Wenzhe Fan,Chin-Yuan Yeh,Panpan Meng,Liancheng Fang,Jinhu Qi,Wei-Chieh Huang,Zhengyao Gu,Yuwei Han,Langzhou He,Yuyao Yang,Xue Liu,Irwin King,Philip S. Yu

Main category: cs.CL

TL;DR: RECODE-H是一个评估LLM代理在多轮交互中生成科研代码能力的新基准，结合反馈机制提升代码生成效果。

Details

Motivation: 现有LLM在生成正确且可执行的科研代码方面能力有限，且多采用单轮设置，忽略了科研开发中迭代和反馈驱动的实际工作流程。 Method: 提出RECODE-H基准，包含102个来自论文和代码库的任务，引入结构化指令、单元测试和五级反馈层次；同时提出ReCodeAgent框架，将反馈整合到迭代代码生成过程中。 Result: 实验表明，在更丰富的反馈下，包括GPT-5、Claude-Sonnet-4等主流LLM的性能显著提升，但在复杂科研代码生成上仍存在挑战。 Conclusion: RECODE-H为发展适应性强、基于反馈的LLM代理在科研实现中的应用奠定了基础。 Abstract: Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

[91] BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects

Jakir Hasan,Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: 本文提出了BanglaTalk，首个支持孟加拉语方言的实时语音助手系统，采用客户端-服务器架构和RTP协议实现低延迟通信，并通过微调IndicWav2Vec模型构建方言感知ASR系统BRDialect，在十种方言上显著优于基线模型，系统在24kbps低带宽下平均端到端延迟仅4.9秒，具有成本效益且交互性强。

Details

Motivation: 孟加拉语作为一种资源较少且方言差异大的语言，现有语音助手系统多局限于标准语且不适用于实时场景，缺乏对地区方言的支持，限制了技术的可及性与包容性。 Method: 提出BanglaTalk系统，采用客户端-服务器架构与RTP协议保障低延迟；构建方言感知ASR系统BRDialect，通过在十种孟加拉地方言数据上微调IndicWav2Vec模型实现方言适应；在RegSpeech12数据集上评估性能。 Result: BRDialect在RegSpeech12数据集上比基线ASR模型性能提升12.41%-33.98%；系统可在24kbps低带宽运行，平均端到端延迟为4.9秒，具备良好的实时性和网络适应能力。 Conclusion: BanglaTalk是首个支持孟加拉语方言的实时语音助手系统，有效解决了方言多样性与实时性挑战，提升了低资源语言语音技术的可及性与包容性，具有实际应用价值。 Abstract: Real-time speech assistants are becoming increasingly popular for ensuring improved accessibility to information. Bengali, being a low-resource language with a high regional dialectal diversity, has seen limited progress in developing such systems. Existing systems are not optimized for real-time use and focus only on standard Bengali. In this work, we present BanglaTalk, the first real-time speech assistance system for Bengali regional dialects. BanglaTalk follows the client-server architecture and uses the Real-time Transport Protocol (RTP) to ensure low-latency communication. To address dialectal variation, we introduce a dialect-aware ASR system, BRDialect, developed by fine-tuning the IndicWav2Vec model in ten Bengali regional dialects. It outperforms the baseline ASR models by 12.41-33.98% on the RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of 24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low bandwidth usage and minimal end-to-end delay make the system both cost-effective and interactive for real-time use cases, enabling inclusive and accessible speech technology for the diverse community of Bengali speakers.

[92] Latent Speech-Text Transformer

Yen-Ju Lu,Yashesh Gaur,Wei Zhou,Benjamin Muller,Jesus Villalba,Najim Dehak,Luke Zettlemoyer,Gargi Ghosh,Mike Lewis,Srinivasan Iyer,Duc Le

Main category: cs.CL

TL;DR: 提出Latent Speech-Text Transformer (LST)，通过将语音标记动态聚合成潜在语音块，提升语音-文本模型的预训练效率和表征对齐，显著改善计算与数据效率，并实现更优的扩展性。

Details

Motivation: 现有自回归语音-文本模型因语音标记序列远长于文本标记，导致模态间计算不平衡，影响对齐效果和扩展速度。 Method: 引入LST模型，动态聚合语音标记为潜在语音块，作为高层单元与文本对齐或封装常见语音序列（如静音），提升计算和数据效率。 Result: 在语音到语音和文本到文本任务上，LST在计算和数据受限设置下均优于基线；在HellaSwag任务中，语音准确率分别提升6.5%（计算控制）和5.3%（数据控制），同时提升文本性能。 Conclusion: LST通过潜在语音块机制有效缓解了语音-文本模型中的计算不平衡问题，增强了表示对齐，实现了更快的扩展速度和更高的训练效率。 Abstract: Auto-regressive speech-text models are typically pre-trained on a large number of interleaved sequences of text tokens and raw speech encoded as speech tokens using vector quantization. These models have demonstrated state-of-the-art performance in speech-to-speech understanding and generation benchmarks, together with promising scaling laws, primarily enabled by the representational alignment between text and speech. Nevertheless, they suffer from shortcomings, partly owing to the disproportionately longer sequences of speech tokens in contrast to textual tokens. This results in a large compute imbalance between modalities during pre-training as well as during inference, and a potential hindrance to effectively aligning speech and text, ultimately translating to several orders of magnitude slower scaling laws. We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient by dynamically and inexpensively aggregating speech tokens into latent speech patches. These patches serve as higher-level units that can either align with corresponding textual units to aid capability transfer or even encapsulate common speech sequences like silences to be more compute-efficient. We show that LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings, the former indicating more effective representational alignment and the latter indicating steeper scaling laws for speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute gain in speech accuracy under compute-controlled training and 5.3% under data-controlled training, while also improving text performance. We will release our models, code, and the evaluation data to facilitate further research.

[93] Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Xinyu Guo,Zhengliang Shi,Minglai Yang,Mahdi Rahimi,Mihai Surdeanu

Main category: cs.CL

TL;DR: 本文提出了一种名为CogRE的关系抽取框架，结合认知科学启发的推理机制和强化学习优化，显著提升了准确性和可解释性。

Details

Motivation: 传统关系抽取缺乏对语言解释的监督，导致模型注意力不集中且少样本学习能力有限，本文旨在通过引入结构化推理和奖励机制来改善解释质量和性能。 Method: CogRE采用基于认知科学的逐步文本处理推理机制，并利用强化学习与新型奖励函数进行优化；同时使用大语言模型自动构建高质量词典以提取关键关系词作为解释依据。 Result: 在One-shot NYT29数据集上，CogRE结合Qwen2.5-15B-Instruct达到24.65%的F1值，经RL优化后绝对提升23.46%；人工评估显示生成的关系关键词与真实标签高度一致，解释质量相对提升54%。 Conclusion: CogRE通过结构化推理和强化学习有效提升了关系抽取的准确性与可解释性，尤其在少样本场景下显著优于现有方法。 Abstract: This paper introduces a framework for relation extraction (RE) that enhances both accuracy and explainability. The framework has two key components: (i) a reasoning mechanism that formulates relation extraction as a series of text-processing steps inspired by cognitive science, and (ii) an optimization process driven by reinforcement learning (RL) with a novel reward function designed to improve both task accuracy and explanation quality. We call our approach CogRE. Our framework addresses the lack of supervision for language-based explanations in traditional RE by promoting outputs that include important relation keywords. These keywords are drawn from a high-quality dictionary that is automatically constructed using an LLM. We evaluate our approach for the task of one-shot RE using two LLMs and two RE datasets. Our experiments show that CogRE improves explanation quality by addressing two common failure patterns in one-shot RE: poor attention focus and limited one-shot learning capability. For example, our cognitive-structured reasoning with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using our reward further improves performance by +23.46% (absolute). Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).

cs.CV [Back]

[94] Attention-Enhanced Prototypical Learning for Few-Shot Infrastructure Defect Segmentation

Christina Thrainer,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Christian Guetl,Steven Sloan,Kendall N. Niles,Ken Pathak

Main category: cs.CV

TL;DR: 本文提出了一种用于涵洞和下水道缺陷少样本语义分割的增强特征金字塔网络（E-FPN）框架，结合原型学习和注意力机制，在标注数据稀缺的情况下实现了高效准确的缺陷识别。

Details

Motivation: 现有深度学习方法需要大量标注数据，且难以用少量样本学习新缺陷类别，限制了其在基础设施检测中的应用。 Method: 提出E-FPN框架，包含三个核心：(1) 使用InceptionSepConv和深度可分离卷积的自适应编码器；(2) 基于掩码平均池化的原型学习生成有效原型；(3) 结合全局、局部和交叉注意力的特征表示。 Result: 在真实基础设施检测数据集上，8类5样本训练配置下达到82.55% F1分数和72.26% mIoU；自注意力机制带来最显著提升，较基线提高2.57% F1和2.9% mIoU。 Conclusion: 该框架能有效应对基础设施检测中新增缺陷类型识别需求，仅需少量新样本即可实现高性能，有助于提升关键基础设施维护的效率与经济性。 Abstract: Few-shot semantic segmentation is vital for deep learning-based infrastructure inspection applications, where labeled training examples are scarce and expensive. Although existing deep learning frameworks perform well, the need for extensive labeled datasets and the inability to learn new defect categories with little data are problematic. We present our Enhanced Feature Pyramid Network (E-FPN) framework for few-shot semantic segmentation of culvert and sewer defect categories using a prototypical learning framework. Our approach has three main contributions: (1) adaptive E-FPN encoder using InceptionSepConv blocks and depth-wise separable convolutions for efficient multi-scale feature extraction; (2) prototypical learning with masked average pooling for powerful prototype generation from small support examples; and (3) attention-based feature representation through global self-attention, local self-attention and cross-attention. Comprehensive experimentation on challenging infrastructure inspection datasets illustrates that the method achieves excellent few-shot performance, with the best configuration being 8-way 5-shot training configuration at 82.55% F1-score and 72.26% mIoU in 2-way classification testing. The self-attention method had the most significant performance improvements, providing 2.57% F1-score and 2.9% mIoU gain over baselines. Our framework addresses the critical need to rapidly respond to new defect types in infrastructure inspection systems with limited new training data that lead to more efficient and economical maintenance plans for critical infrastructure systems.

[95] SkinMap: Weighted Full-Body Skin Segmentation for Robust Remote Photoplethysmography

Zahra Maleki,Amirhossein Akbari,Amirhossein Binesh,Babak Khalaj

Main category: cs.CV

TL;DR: 提出一种新的皮肤区域分割技术，用于提升远程光电容积描记法（rPPG）在复杂条件下的信号质量，具有良好的抗运动干扰能力和广泛的肤色适应性。

Details

Motivation: 传统rPPG方法对光照和运动敏感，且常因非皮肤区域干扰导致信号质量下降，因此需要更鲁棒的皮肤区域选择方法。 Method: 提出一种优先提取全身皮肤区域并排除口、眼、头发等干扰区域的新型皮肤分割技术，并在公开数据集及新构建的SYNC-rPPG数据集上进行评估。 Result: 该方法在说话、头部转动等挑战性条件下仍能准确捕捉心跳信号，保持较低的平均绝对误差（MAE），且在多种肤色上均表现出高检测精度。 Conclusion: 所提出的皮肤分割技术显著提升了rPPG信号的稳定性和准确性，尤其在真实场景中具有更强的鲁棒性和应用潜力。 Abstract: Remote photoplethysmography (rPPG) is an innovative method for monitoring heart rate and vital signs by using a simple camera to record a person, as long as any part of their skin is visible. This low-cost, contactless approach helps in remote patient monitoring, emotion analysis, smart vehicle utilization, and more. Over the years, various techniques have been proposed to improve the accuracy of this technology, especially given its sensitivity to lighting and movement. In the unsupervised pipeline, it is necessary to first select skin regions from the video to extract the rPPG signal from the skin color changes. We introduce a novel skin segmentation technique that prioritizes skin regions to enhance the quality of the extracted signal. It can detect areas of skin all over the body, making it more resistant to movement, while removing areas such as the mouth, eyes, and hair that may cause interference. Our model is evaluated on publicly available datasets, and we also present a new dataset, called SYNC-rPPG, to better represent real-world conditions. The results indicate that our model demonstrates a prior ability to capture heartbeats in challenging conditions, such as talking and head rotation, and maintain the mean absolute error (MAE) between predicted and actual heart rates, while other methods fail to do so. In addition, we demonstrate high accuracy in detecting a diverse range of skin tones, making this technique a promising option for real-world applications.

[96] DeepAf: One-Shot Spatiospectral Auto-Focus Model for Digital Pathology

Yousef Yeganeh,Maximilian Frantzen,Michael Lee,Kun-Hsing Yu,Nassir Navab,Azade Farshad

Main category: cs.CV

TL;DR: 提出了一种名为DeepAf的新型自动对焦框架，结合空间和光谱特征，实现单次拍摄即可预测对焦，显著减少对焦时间并保持高精度，使传统显微镜转变为高效扫描仪，适用于资源受限环境下的实时数字病理诊断。

Details

Motivation: 全切片成像（WSI）扫描仪成本高昂，限制了其在许多医疗环境中的可及性；现有低成本方案在跨组织类型和染色协议的泛化能力、对焦一致性或速度方面存在局限。 Method: 开发了DeepAf，一种融合空间与光谱特征的混合架构深度学习模型，通过单张图像回归最优对焦点距离，并动态调整控制参数，集成于自动化显微系统中。 Result: 相比基于焦栈的方法将对焦时间减少80%，在同实验室样本上达到0.18微米的对焦精度，跨实验室应用中90%预测位于景深范围内，误焦率仅0.72%；在536个脑组织样本的临床研究中，4倍放大下癌症分类AUC达0.90。 Conclusion: 该系统通过软硬件协同设计，在保证诊断准确性的前提下，显著降低了数字病理的门槛，实现了资源受限环境下的高效、实时病理成像。 Abstract: While Whole Slide Imaging (WSI) scanners remain the gold standard for digitizing pathology samples, their high cost limits accessibility in many healthcare settings. Other low-cost solutions also face critical limitations: automated microscopes struggle with consistent focus across varying tissue morphology, traditional auto-focus methods require time-consuming focal stacks, and existing deep-learning approaches either need multiple input images or lack generalization capability across tissue types and staining protocols. We introduce a novel automated microscopic system powered by DeepAf, a novel auto-focus framework that uniquely combines spatial and spectral features through a hybrid architecture for single-shot focus prediction. The proposed network automatically regresses the distance to the optimal focal point using the extracted spatiospectral features and adjusts the control parameters for optimal image outcomes. Our system transforms conventional microscopes into efficient slide scanners, reducing focusing time by 80% compared to stack-based methods while achieving focus accuracy of 0.18 {\mu}m on the same-lab samples, matching the performance of dual-image methods (0.19 {\mu}m) with half the input requirements. DeepAf demonstrates robust cross-lab generalization with only 0.72% false focus predictions and 90% of predictions within the depth of field. Through an extensive clinical study of 536 brain tissue samples, our system achieves 0.90 AUC in cancer classification at 4x magnification, a significant achievement at lower magnification than typical 20x WSI scans. This results in a comprehensive hardware-software design enabling accessible, real-time digital pathology in resource-constrained settings while maintaining diagnostic accuracy.

[97] Fine-Tuned CNN-Based Approach for Multi-Class Mango Leaf Disease Detection

Jalal Ahmmed,Faruk Ahmed,Rashedul Hasan Shohan,Md. Mahabub Rana,Mahdi Hasan

Main category: cs.CV

TL;DR: 本研究评估了五种预训练卷积神经网络在芒果叶病多分类识别中的性能，DenseNet201表现最佳，准确率达99.33%。

Details

Motivation: 芒果种植常受叶部病害影响，亟需高效准确的病害识别方法以提升产量和质量。 Method: 采用迁移学习策略对DenseNet201、InceptionV3、ResNet152V2、SeResNet152和Xception五种模型进行微调，用于八类芒果叶病的识别，并通过准确率、精确率、召回率、F1分数和混淆矩阵进行评估。 Result: DenseNet201取得最优性能，准确率达99.33%，在识别Cutting Weevil和Bacterial Canker方面表现突出；ResNet152V2和SeResNet152效果良好，InceptionV3和Xception在视觉相似类别中表现较差。 Conclusion: 微调后的迁移学习模型（尤其是DenseNet201）能实现精准可靠的多类别芒果叶病检测，适用于智能农业应用。 Abstract: Mango is an important fruit crop in South Asia, but its cultivation is frequently hampered by leaf diseases that greatly impact yield and quality. This research examines the performance of five pre-trained convolutional neural networks, DenseNet201, InceptionV3, ResNet152V2, SeResNet152, and Xception, for multi-class identification of mango leaf diseases across eight classes using a transfer learning strategy with fine-tuning. The models were assessed through standard evaluation metrics, such as accuracy, precision, recall, F1-score, and confusion matrices. Among the architectures tested, DenseNet201 delivered the best results, achieving 99.33% accuracy with consistently strong metrics for individual classes, particularly excelling in identifying Cutting Weevil and Bacterial Canker. Moreover, ResNet152V2 and SeResNet152 provided strong outcomes, whereas InceptionV3 and Xception exhibited lower performance in visually similar categories like Sooty Mould and Powdery Mildew. The training and validation plots demonstrated stable convergence for the highest-performing models. The capability of fine-tuned transfer learning models, for precise and dependable multi-class mango leaf disease detection in intelligent agricultural applications.

[98] Mitigating Diffusion Model Hallucinations with Dynamic Guidance

Kostas Triaridis,Alexandros Graikos,Aggelina Chatziagapi,Grigorios G. Chrysos,Dimitris Samaras

Main category: cs.CV

TL;DR: 本文提出了Dynamic Guidance方法，通过在生成时选择性地锐化导致伪影的方向上的得分函数来减少扩散模型中的幻觉问题，同时保留有效的语义变化。

Details

Motivation: 扩散模型虽然表现出色，但常产生结构不一致的幻觉样本，这是由于数据分布模式间的过度平滑所致。然而语义插值仍然有价值，因此需要更细致的解决方案。 Method: 引入Dynamic Guidance，在预定义的易产生伪影方向上选择性锐化得分函数，以减少幻觉，同时保持有意义的语义插值能力。 Result: Dynamic Guidance在控制和自然图像数据集上显著减少了幻觉现象，并明显优于基线方法。该方法是首个在生成时而非事后过滤阶段解决幻觉问题的方法。 Conclusion: Dynamic Guidance有效平衡了生成多样性与结构真实性，为扩散模型生成质量提供了新的改进路径。 Abstract: Diffusion models, despite their impressive demos, often produce hallucinatory samples with structural inconsistencies that lie outside of the support of the true data distribution. Such hallucinations can be attributed to excessive smoothing between modes of the data distribution. However, semantic interpolations are often desirable and can lead to generation diversity, thus we believe a more nuanced solution is required. In this work, we introduce Dynamic Guidance, which tackles this issue. Dynamic Guidance mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

[99] LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation

Yang Xiao,Gen Li,Kaiyuan Deng,Yushu Wu,Zheng Zhan,Yanzhi Wang,Xiaolong Ma,Bo Hui

Main category: cs.CV

TL;DR: 本文提出了一种针对基于扩散模型的视频生成中训练-free加速方法的内存优化方案，通过阶段特定策略减少内存消耗，同时保持推理速度提升和质量损失在可接受范围内。

Details

Motivation: 现有的基于缓存的加速方法在去噪和解码阶段导致显著的内存激增，限制了其在资源受限环境下的应用。 Method: 将推理过程分解为编码、去噪和解码三个阶段，提出三种阶段特定的策略：异步缓存交换、特征分块和切片解码，以降低内存使用。 Result: 相比基线方法，该方法在降低内存占用的同时实现了更快的推理速度，且质量下降在可接受范围内。 Conclusion: 所提出的阶段特定内存优化策略有效平衡了加速与内存消耗，适用于训练-free的扩散模型视频生成。 Abstract: Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .

[100] See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras,Luis Toscano-Palomino,Mauro Dalla Mura,Jorge Bacca

Main category: cs.CV

TL;DR: 提出一种基于RGB和热成像配对图像的时间反向重建框架，结合视觉-语言模型与约束扩散过程，恢复数秒前的场景状态，可追溯至120秒前，实现从热痕迹中进行时间反向成像的初步探索。

Details

Motivation: 由于人类体温通常高于周围环境，其与物体的接触会留下逐渐消退的热痕迹，这些痕迹蕴含了近期活动的信息，而传统RGB相机无法捕捉此类信息，因此希望通过热成像恢复过去发生的事件。 Method: 结合视觉-语言模型（VLM）与约束扩散过程，利用配对的RGB和热成像图像，一个VLM生成场景描述，另一个VLM引导图像重建，确保语义与结构的一致性，从而实现对过去场景的状态反推。 Result: 在三个受控场景中验证了该方法的有效性，能够重建最多120秒之前的合理过去帧，证明了从热痕迹中进行时间反向成像的可行性。 Conclusion: 该方法为利用热成像进行时间反向场景重建提供了可行方案，是实现从被动热痕迹推断历史事件的重要第一步。 Abstract: Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

[101] Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"

Bruno Korbar,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文提出了一种名为pi-map的可训练映射网络，通过将局部图像嵌入转换为文本token，结合自然语言查询实现基于复合查询的个性化图像检索。

Details

Motivation: 为了实现结合图像中物体实例信息与自然语言描述的复合查询图像检索，解决现有方法在个性化检索上的不足。 Method: 设计一个映射网络pi-map，将对象实例的局部图像嵌入转换为适合CLIP文本编码的文本token，并与自然语言查询结合进行图像检索，仅需对每个实例进行一次简单训练。 Result: 在两个评估个性化检索的基准上，该方法结合冻结的CLIP编码器显著优于现有技术。 Conclusion: pi-map能有效提升基于实例和文本描述的复合查询图像检索性能，具有高效性和可扩展性。 Abstract: The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

[102] ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars

Peizhi Yan,Rabab Ward,Qiang Tang,Shan Du

Main category: cs.CV

TL;DR: 本文提出了ArchitectHead，首个支持连续调节细节层次（LOD）的3D高斯点阵头像框架，通过在2D UV特征空间中参数化高斯点并使用多级可学习特征图实现高效、无需重新训练的LOD控制。

Details

Motivation: 现有3D高斯点阵头像缺乏对细节层次的灵活控制，难以在渲染效率与视觉质量之间取得平衡，限制了其在实际应用中的部署。 Method: 提出在2D UV特征空间中参数化高斯点，构建包含多级可学习特征图的UV特征场来编码潜在特征，并利用轻量级神经网络解码器将其转换为3D高斯属性；通过动态重采样不同分辨率的特征图来调节高斯点数量，实现连续LOD控制。 Result: ArchitectHead在最高LOD下实现了自及跨身份重演任务的最先进质量，在低LOD下性能接近最优；最低LOD时仅使用6.2%的高斯点，质量适度下降，渲染速度几乎翻倍。 Conclusion: ArchitectHead首次实现了无需重新训练的连续LOD控制，有效平衡了渲染效率与视觉质量，适用于多种实际应用场景。 Abstract: 3D Gaussian Splatting (3DGS) has enabled photorealistic and real-time rendering of 3D head avatars. Existing 3DGS-based avatars typically rely on tens of thousands of 3D Gaussian points (Gaussians), with the number of Gaussians fixed after training. However, many practical applications require adjustable levels of detail (LOD) to balance rendering efficiency and visual quality. In this work, we propose "ArchitectHead", the first framework for creating 3D Gaussian head avatars that support continuous control over LOD. Our key idea is to parameterize the Gaussians in a 2D UV feature space and propose a UV feature field composed of multi-level learnable feature maps to encode their latent features. A lightweight neural network-based decoder then transforms these latent features into 3D Gaussian attributes for rendering. ArchitectHead controls the number of Gaussians by dynamically resampling feature maps from the UV feature field at the desired resolutions. This method enables efficient and continuous control of LOD without retraining. Experimental results show that ArchitectHead achieves state-of-the-art (SOTA) quality in self and cross-identity reenactment tasks at the highest LOD, while maintaining near SOTA performance at lower LODs. At the lowest LOD, our method uses only 6.2\% of the Gaussians while the quality degrades moderately (L1 Loss +7.9\%, PSNR --0.97\%, SSIM --0.6\%, LPIPS Loss +24.1\%), and the rendering speed nearly doubles.

[103] Human Action Recognition from Point Clouds over Time

James Dickens

Main category: cs.CV

TL;DR: 本文提出了一种基于3D点云视频的人体动作识别新方法，结合点云分割、跟踪与体部分割，并融合点基技术与稀疏卷积网络，在NTU RGB-D 120数据集上取得了优于现有点云方法的89.3%准确率。

Details

Motivation: 随着深度传感器和Lidar设备的普及，利用密集3D数据进行动作识别成为可能，但现有研究主要集中于骨骼或视频方法，缺乏对原始点云数据的有效利用。 Method: 提出一个包含人体点云分割、跨帧跟踪和体部分割的处理流程；采用结合点基技术和稀疏卷积网络的新型骨干网络处理体素化点云序列，并引入表面法线、颜色、红外强度和体部分割标签等辅助特征提升性能。 Result: 在NTU RGB-D 120数据集上，该方法在跨主体设置下达到89.3%的准确率，优于此前的点云动作识别方法，且与现有的骨骼识别算法相当。 Conclusion: 所提出的3D点云动作识别框架有效利用多模态3D数据，通过融合先进网络结构与丰富点云特征，为非骨骼型动作识别提供了可行的新路径。 Abstract: Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.

[104] Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models

Shinnosuke Saito,Takashi Matsubara

Main category: cs.CV

TL;DR: 提出了一种基于黎曼度量的噪声空间插值方法，利用得分函数的雅可比矩阵捕捉数据流形结构，使扩散模型在生成过程中保持在数据流形上，从而实现更自然、保真的图像插值。

Details

Motivation: 扩散模型缺乏显式的低维潜在空间，现有插值方法易偏离数据流形，导致过渡不自然，因此需要一种能沿数据流形进行插值的方法。 Method: 提出一种新的噪声空间上的黎曼度量，该度量基于得分函数的雅可比矩阵来估计局部数据流形的切空间，引导噪声空间中的测地线沿数据流形行进。 Result: 在图像插值任务中，所提方法相比基于密度和简单基线方法，生成了感知更自然、更忠实于数据流形的过渡结果。 Conclusion: 通过引入基于得分函数几何特性的黎曼度量，有效利用了扩散模型学习到的数据流形结构，提升了插值的质量和语义合理性。 Abstract: Diffusion models are powerful deep generative models (DGMs) that generate high-fidelity, diverse content. However, unlike classical DGMs, they lack an explicit, tractable low-dimensional latent space that parameterizes the data manifold. This absence limits manifold-aware analysis and operations, such as interpolation and editing. Existing interpolation methods for diffusion models typically follow paths through high-density regions, which are not necessarily aligned with the data manifold and can yield perceptually unnatural transitions. To exploit the data manifold learned by diffusion models, we propose a novel Riemannian metric on the noise space, inspired by recent findings that the Jacobian of the score function captures the tangent spaces to the local data manifold. This metric encourages geodesics in the noise space to stay within or run parallel to the learned data manifold. Experiments on image interpolation show that our metric produces perceptually more natural and faithful transitions than existing density-based and naive baselines.

[105] Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation

Sam Sartor,Pieter Peers

Main category: cs.CV

TL;DR: 本文提出了一种名为Teamwork的灵活高效统一方法，通过协调多个预训练扩散模型实例（即“队友”）来实现通道扩展和任务适应，而无需修改原有模型结构。

Details

Motivation: 现有的通道扩展方法通常针对特定应用，难以适应不同的扩散模型或新任务，因此需要一种通用且灵活的解决方案。 Method: 采用低秩适配（LoRA）的变体来联合优化多个扩散模型实例之间的协调与适应，并支持队友的动态激活与去激活。 Result: 在图像修复、单幅图像SVBRDF估计、本征分解、神经着色和本征图像合成等多种生成与逆向图形任务上验证了该方法的有效性与灵活性。 Conclusion: Teamwork提供了一种无需修改预训练模型架构即可扩展输入输出通道并适应新任务的通用框架，具有良好的灵活性和效率。 Abstract: Large pretrained diffusion models can provide strong priors beneficial for many graphics applications. However, generative applications such as neural rendering and inverse methods such as SVBRDF estimation and intrinsic image decomposition require additional input or output channels. Current solutions for channel expansion are often application specific and these solutions can be difficult to adapt to different diffusion models or new tasks. This paper introduces Teamwork: a flexible and efficient unified solution for jointly increasing the number of input and output channels as well as adapting a pretrained diffusion model to new tasks. Teamwork achieves channel expansion without altering the pretrained diffusion model architecture by coordinating and adapting multiple instances of the base diffusion model (\ie, teammates). We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address both adaptation and coordination between the different teammates. Furthermore Teamwork supports dynamic (de)activation of teammates. We demonstrate the flexibility and efficiency of Teamwork on a variety of generative and inverse graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic decomposition, neural shading, and intrinsic image synthesis.

[106] Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work

Owen Henkel,Bill Roberts,Doug Jaffe,Laurence Holt

Main category: cs.CV

TL;DR: 该论文探讨了多模态大语言模型（MLLMs）在批改和分析手写学生数学作业中的应用潜力。实验表明，MLLMs在客观算术题上表现接近人类水平，但在需视觉与教学双重判断的数学绘图任务中表现较差，依赖人工描述后性能显著提升。

Details

Motivation: 由于中小学数学作业多为手写，人工批改耗时但能提供学习过程洞察，因此探索MLLMs是否能有效处理此类任务具有重要教育意义。 Method: 进行了两项实验：实验A评估MLLMs对加纳中学生手写算术答案的评分能力；实验B测试模型对美国小学生数学绘图的分析能力，并比较直接看图评分与基于人工描述评分的表现。 Result: 实验A中模型准确率达95%（k=0.90），接近人类水平；实验B中直接分析绘图时一致性低（k=0.20），但在输入人工描述后显著提升至k=0.47，达到人-人一致性水平。 Conclusion: MLLMs能较好‘看见’并解读算术过程，但对数学绘图的理解仍有限，当前尚无法独立完成需复杂视觉与教学判断的任务，需结合人工描述辅助。 Abstract: Recent advances in multimodal large language models (MLLMs) raise the question of their potential for grading, analyzing, and offering feedback on handwritten student classwork. This capability would be particularly beneficial in elementary and middle-school mathematics education, where most work remains handwritten, because seeing students' full working of a problem provides valuable insights into their learning processes, but is extremely time-consuming to grade. We present two experiments investigating MLLM performance on handwritten student mathematics classwork. Experiment A examines 288 handwritten responses from Ghanaian middle school students solving arithmetic problems with objective answers. In this context, models achieved near-human accuracy (95%, k = 0.90) but exhibited occasional errors that human educators would be unlikely to make. Experiment B evaluates 150 mathematical illustrations from American elementary students, where the drawings are the answer to the question. These tasks lack single objective answers and require sophisticated visual interpretation as well as pedagogical judgment in order to analyze and evaluate them. We attempted to separate MLLMs' visual capabilities from their pedagogical abilities by first asking them to grade the student illustrations directly, and then by augmenting the image with a detailed human description of the illustration. We found that when the models had to analyze the student illustrations directly, they struggled, achieving only k = 0.20 with ground truth scores, but when given human descriptions, their agreement levels improved dramatically to k = 0.47, which was in line with human-to-human agreement levels. This gap suggests MLLMs can "see" and interpret arithmetic work relatively well, but still struggle to "see" student mathematical illustrations.

[107] Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

Christopher Hoang,Mengye Ren

Main category: cs.CV

TL;DR: 本文提出了Midway Network，一种新的自监督学习架构，首次从自然视频中同时学习物体识别和运动理解的强视觉表示。

Details

Motivation: 现有的自监督方法主要关注物体识别或运动理解中的一个方面，缺乏对两者联合学习的有效方法。 Method: 通过扩展潜在动态建模，引入中层自上而下的路径来推断帧间运动潜在表示，并采用密集前向预测目标和分层结构处理复杂的多物体场景。 Result: 在两个大规模自然视频数据集上预训练后，Midway Network在语义分割和光流任务上均优于先前的自监督方法，并能通过前向特征扰动分析捕捉高层对应关系。 Conclusion: Midway Network能够有效联合学习物体识别与运动理解，为自监督视觉表征学习提供了新思路。 Abstract: Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network's learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

[108] HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video

Hongchi Xia,Chih-Hao Lin,Hao-Yu Hsu,Quentin Leboutet,Katelyn Gao,Michael Paulitsch,Benjamin Ummenhofer,Shenlong Wang

Main category: cs.CV

TL;DR: HoloScene是一个新的交互式3D重建框架，通过综合的场景图表示和能量优化方法，实现几何完整、物理合理且可交互的虚拟环境构建。

Details

Motivation: 现有3D重建和场景理解方法在几何完整性、对象交互性、物理合理性、渲染真实感或动态仿真性能方面存在不足。 Method: 提出HoloScene框架，采用包含几何、外观和物理属性的场景图表示，将重建建模为结合观测数据、物理约束和生成先验的能量优化问题，并通过采样探索与梯度优化相结合的方法高效求解。 Result: 在多个基准数据集上表现优越，生成的数字孪生体具有完整的几何结构、物理稳定性和新视角下的逼真渲染效果。 Conclusion: HoloScene能同时满足高质量虚拟环境构建的多项关键需求，适用于增强现实、游戏和机器人等领域。 Abstract: Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.

[109] CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

Bin Kang,Bin Chen,Junjie Wang,Yulin Li,Junzhi Zhao,Zhuotao Tian

Main category: cs.CV

TL;DR: 提出了一种无需训练的视觉语言模型校准方法CalibCLIP，通过在视觉和文本空间中分别抑制主导标记和增强判别性概念，提升了文本驱动图像检索性能。

Details

Motivation: 现有视觉语言模型中少数低贡献标记可能过度捕获全局语义，抑制判别特征，影响图像检索效果。 Method: 在视觉空间中设计对比视觉增强器（CVE），解耦视觉特征并动态抑制主导标记；在文本空间中引入判别概念校准器（DCC），区分并优化通用与判别性概念表示。 Result: 在七个基准上进行了广泛实验，涵盖三类图像检索任务，均取得一致性能提升。 Conclusion: CalibCLIP有效缓解了主导标记的抑制效应，增强了细粒度区分能力，显著提升了文本到图像检索的性能。 Abstract: Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP

[110] Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

Zeqi Gu,Markos Georgopoulos,Xiaoliang Dai,Marjan Ghazvininejad,Chu Wang,Felix Juefei-Xu,Kunpeng Li,Yujun Shi,Zecheng He,Zijian He,Jiawei Zhou,Abe Davis,Jialiang Wang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级优化框架ShortCoTI，用于生成更简洁的思维链（CoT）序列，以提高图像生成效率，同时保持图像质量。

Details

Motivation: 现有的基于CoT推理的多模态生成模型容易引入冗余信息（即视觉过思考），增加计算成本并可能导致与原始提示矛盾的细节。 Method: 引入ShortCoTI框架，在强化学习中使用自适应奖励函数鼓励生成更简洁的CoT提示，该函数根据每个任务的估计难度进行调整。 Result: 在多个基准测试中（如T2I-CompBench、GenEval），提示推理长度减少了54%，图像质量保持或略有提升；定性分析显示冗长解释和重复修改被有效消除。 Conclusion: ShortCoTI在不牺牲生成图像保真度和视觉吸引力的前提下，显著提高了推理过程的计算效率。 Abstract: Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.

[111] HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

Junwen Chen,Peilin Xiong,Keiji Yanai

Main category: cs.CV

TL;DR: 本文提出HOI-R1，首次探索使用纯语言模型（MLLM）在无需检测模块的情况下进行人-物交互检测（HOID），通过引入HOI推理过程和奖励函数，结合强化学习方法，在HICO-DET数据集上实现两倍于基线的准确率。

Details

Motivation: 现有HOID方法依赖视觉语言模型（VLMs）的先验知识，且需复杂架构与训练策略，限制了可扩展性；同时，多模态大语言模型（MLLMs）在HOID任务中的推理能力尚未被充分挖掘。 Method: 提出HOI-R1，利用MLLM的内在推理能力，设计基于文本的HOI推理流程和专门的HOID奖励函数，采用强化学习训练，完全脱离传统检测模块进行交互识别。 Result: 在HICO-DET数据集上，HOI-R1的准确率达到基线方法的两倍，并展现出强大的泛化能力。 Conclusion: 纯语言模型结合强化学习可有效解决人-物交互检测任务，无需复杂的检测架构，为HOID提供了新范式。 Abstract: Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.

[112] Efficient Conditional Generation on Scale-based Visual Autoregressive Models

Jiaqi Liu,Tao Huang,Chang Xu

Main category: cs.CV

TL;DR: 本文提出了一个高效的即插即用控制框架ECM，用于提升自回归模型在复杂空间条件图像生成中的控制能力，同时显著降低训练和推理成本。

Details

Motivation: 现有的自回归模型在复杂空间条件生成任务中依赖微调，导致训练成本高昂，缺乏高效、灵活的控制方法。 Method: 提出ECM框架，包含上下文感知注意力层和共享门控前馈网络，并采用早期集中采样策略与推理时温度调度来优化训练效率和生成质量。 Result: 在基于尺度的自回归模型上实验表明，ECM在生成保真度和多样性上优于现有基线方法，同时显著提升了训练和推理效率。 Conclusion: ECM为自回归图像生成提供了一种高效、可扩展的条件控制方案，无需微调即可实现高质量的可控生成。 Abstract: Recent advances in autoregressive (AR) models have demonstrated their potential to rival diffusion models in image synthesis. However, for complex spatially-conditioned generation, current AR approaches rely on fine-tuning the pre-trained model, leading to significant training costs. In this paper, we propose the Efficient Control Model (ECM), a plug-and-play framework featuring a lightweight control module that introduces control signals via a distributed architecture. This architecture consists of context-aware attention layers that refine conditional features using real-time generated tokens, and a shared gated feed-forward network (FFN) designed to maximize the utilization of its limited capacity and ensure coherent control feature learning. Furthermore, recognizing the critical role of early-stage generation in determining semantic structure, we introduce an early-centric sampling strategy that prioritizes learning early control sequences. This approach reduces computational cost by lowering the number of training tokens per iteration, while a complementary temperature scheduling during inference compensates for the resulting insufficient training of late-stage tokens. Extensive experiments on scale-based AR models validate that our method achieves high-fidelity and diverse control over image generation, surpassing existing baselines while significantly improving both training and inference efficiency.

[113] PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

Ziqiao Meng,Qichao Wang,Zhiyang Dou,Zixing Song,Zhipeng Zhou,Irwin King,Peilin Zhao

Main category: cs.CV

TL;DR: 提出PointNSP，一种基于粗到精生成框架的自回归点云生成方法，首次在自回归范式中达到SOTA，并在效率和8192点稠密生成中优于扩散模型。

Details

Motivation: 自回归模型因强制序列化无序点集而难以捕捉长程依赖，导致在全局结构（如对称性、拓扑一致性）上表现不佳，落后于扩散模型。 Method: 受形状建模中细节层次（LOD）启发，采用多尺度分解的下一尺度预测范式，先保留低分辨率下的全局结构，逐步细化高分辨率几何细节，避免固定顺序问题。 Result: 在ShapeNet上实现了自回归方法中的SOTA生成质量，超越强扩散基线，在参数量、训练和推理效率方面更优；在8192点稠密生成中优势更明显。 Conclusion: PointNSP通过多尺度自回归设计有效解决了点云生成中的序列偏差问题，在质量与效率上均取得领先，展现出良好的可扩展性。 Abstract: Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model's capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP's advantages become even more pronounced, underscoring its scalability potential.

[114] TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation

Guangrong Wan,Jun liu,Tang tang,Lianghao Shi,Wenjun Luo,TingTing Xu

Main category: cs.CV

TL;DR: 本文提出了首个用于多任务泪膜分析的TFM数据集，并设计了TF-Net模型和TF-Collab集成 pipeline，实现了泪膜破裂的自动化实时分析。

Details

Motivation: 由于缺乏标注数据集和一体化解决方案，泪膜破裂（TFBU）的自动分割具有挑战性，因此需要构建专用数据集并开发高效、可临床应用的自动化分析方法。 Method: 提出TFM数据集，包含15个高分辨率视频共6247帧，标注了帧分类、Placido环检测和TFBU区域分割三个任务；设计基于MobileOne-mini和增强特征金字塔网络的TF-Net模型；并构建TF-Collab多模型协同的实时分析流程。 Result: TF-Net在精度与计算效率间取得良好平衡，适用于实时应用；在TFM分割子集上建立了基准性能；TF-Collab实现了从输入标准化到BUT计算的全流程自动化。 Conclusion: TF-Net和TF-Collab的有效性验证了该方法在泪膜分析中的潜力，为眼表疾病诊断的后续研究提供了基础和开源资源。 Abstract: Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification ('clear', 'closed', 'broken', 'blur'), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net

[115] InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment

Ibrahim Salihu Yusuf,Iffanice Houndayi,Rym Oualha,Mohamed Aziz Cherif,Kobby Panford-Quainoo,Arnu Pretorius

Main category: cs.CV

TL;DR: InstaGeo是一个开源的端到端框架，通过自动化数据处理、任务特定模型蒸馏和无缝部署，解决地理空间基础模型在实际应用中的数据管道缺失和模型体积过大问题。

Details

Motivation: 现有地理空间基础模型缺乏处理原始卫星影像的自动化流程，且微调后模型体积大，限制了其在人道主义和环境应用中的部署。 Method: InstaGeo集成了三个核心组件：自动数据整理、任务特定的知识蒸馏以生成轻量模型，以及将模型无缝部署为交互式网络地图应用。 Result: 在多个任务中复现结果接近原研究（mIoU差异小），蒸馏模型体积缩小达8倍，计算量和碳排放显著降低；并在作物分割上实现60.65%的mIoU，比先前基线提升12个百分点。 Conclusion: InstaGeo将研究级地理空间模型转化为实用、低碳的工具，推动地理空间AI向数据质量和应用驱动创新转变。 Abstract: Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git

[116] Beyond Spectral Peaks: Interpreting the Cues Behind Synthetic Image Detection

Sara Mandelli,Diego Vila-Portela,David Vázquez-Padín,Paolo Bestagini,Fernando Pérez-González

Main category: cs.CV

TL;DR: 本文系统研究了基于深度学习的生成图像检测器是否真正依赖频域中的周期性峰值，并提出了一种去除这些峰值的方法，发现大多数检测器并不完全依赖这些特征，挑战了领域内的普遍假设。

Details

Motivation: 当前最先进的检测器多为黑箱模型，尚不清楚它们是否真正依赖频域中的周期性峰值（被认为是合成图像的重要标志），这限制了其可解释性和可信度。 Method: 提出一种去除图像频谱峰值的策略，并分析该操作对多种检测器的影响；同时设计了一个仅依赖频域峰值的线性检测器作为可解释的基线模型。 Result: 实验表明，大多数现有检测器在去除频谱峰值后仍能有效工作，说明它们并非根本依赖这些特征；而提出的线性检测器则完全基于峰值进行判断。 Conclusion: 频谱峰值并非多数先进检测器的主要依据，这一发现挑战了当前领域的普遍认知，有助于推动更透明、可靠的取证工具发展。 Abstract: Over the years, the forensics community has proposed several deep learning-based detectors to mitigate the risks of generative AI. Recently, frequency-domain artifacts (particularly periodic peaks in the magnitude spectrum), have received significant attention, as they have been often considered a strong indicator of synthetic image generation. However, state-of-the-art detectors are typically used as black-boxes, and it still remains unclear whether they truly rely on these peaks. This limits their interpretability and trust. In this work, we conduct a systematic study to address this question. We propose a strategy to remove spectral peaks from images and analyze the impact of this operation on several detectors. In addition, we introduce a simple linear detector that relies exclusively on frequency peaks, providing a fully interpretable baseline free from the confounding influence of deep learning. Our findings reveal that most detectors are not fundamentally dependent on spectral peaks, challenging a widespread assumption in the field and paving the way for more transparent and reliable forensic tools.

[117] Combined Hyperbolic and Euclidean Soft Triple Loss Beyond the Single Space Deep Metric Learning

Shozo Saeki,Minoru Kawahara,Hirohisa Aman

Main category: cs.CV

TL;DR: 本文提出了结合双曲空间和欧氏空间的代理损失函数CHEST，用于深度度量学习，在多个基准数据集上实现了最先进的性能。

Details

Motivation: 现有的双曲空间深度度量学习主要依赖于成对损失或无监督正则化损失，缺乏有效的有监督代理损失方法；而代理损失在大规模数据上具有训练复杂度低的优势，因此需要开发适用于双曲空间的代理损失方法。 Method: 提出了一种新的损失函数——CHEST损失，结合了双曲空间和欧氏空间中的代理损失以及基于双曲层次聚类的正则化损失，通过联合优化两种空间的表示来提升学习效果。 Result: 实验表明，CHEST损失在四个基准数据集上均取得了优于现有方法的性能，同时提高了双曲空间和欧氏空间下深度度量学习的准确性和训练稳定性。 Conclusion: 结合双曲与欧氏空间的代理损失能够有效提升深度度量学习的性能，CHEST损失为双曲空间中应用代理损失提供了可行方案，并在大规模数据上展现出优越性。 Abstract: Deep metric learning (DML) aims to learn a neural network mapping data to an embedding space, which can represent semantic similarity between data points. Hyperbolic space is attractive for DML since it can represent richer structures, such as tree structures. DML in hyperbolic space is based on pair-based loss or unsupervised regularization loss. On the other hand, supervised proxy-based losses in hyperbolic space have not been reported yet due to some issues in applying proxy-based losses in a hyperbolic space. However, proxy-based losses are attractive for large-scale datasets since they have less training complexity. To address these, this paper proposes the Combined Hyperbolic and Euclidean Soft Triple (CHEST) loss. CHEST loss is composed of the proxy-based losses in hyperbolic and Euclidean spaces and the regularization loss based on hyperbolic hierarchical clustering. We find that the combination of hyperbolic and Euclidean spaces improves DML accuracy and learning stability for both spaces. Finally, we evaluate the CHEST loss on four benchmark datasets, achieving a new state-of-the-art performance.

[118] Ocular-Induced Abnormal Head Posture: Diagnosis and Missing Data Imputation

Saja Al-Dabet,Sherzod Turaev,Nazar Zaki,Arif O. Khan,Luai Eldweik

Main category: cs.CV

TL;DR: 本研究提出两种深度学习框架（AHP-CADNet和基于课程学习的填补框架）用于自动诊断眼性异常头位并处理缺失临床数据，显著提升诊断准确性与鲁棒性。

Details

Motivation: 现有临床评估方法主观性强且常受限于不完整的医疗记录，难以实现早期准确诊断眼性异常头位，易导致并发症。 Method: 提出AHP-CADNet多级注意力融合框架，结合眼部关键点、头部姿态和临床特征进行可解释性预测；设计基于课程学习的数据填补框架，利用结构化变量和非结构化临床文本恢复缺失数据。 Result: 在PoseGaze-AHP数据集上，AHP-CADNet分类准确率达96.9-99.0%，连续变量预测MAE为0.103-0.199，R²超过0.93；填补框架准确率达93.46-99.78%（使用PubMedBERT），且临床依赖建模显著提升性能（p < 0.001）。 Conclusion: 所提两种框架能有效支持眼性异常头位的自动化诊断，并在真实临床数据缺失情况下保持高鲁棒性，具有实际应用潜力。 Abstract: Ocular-induced abnormal head posture (AHP) is a compensatory mechanism that arises from ocular misalignment conditions, such as strabismus, enabling patients to reduce diplopia and preserve binocular vision. Early diagnosis minimizes morbidity and secondary complications such as facial asymmetry; however, current clinical assessments remain largely subjective and are further complicated by incomplete medical records. This study addresses both challenges through two complementary deep learning frameworks. First, AHP-CADNet is a multi-level attention fusion framework for automated diagnosis that integrates ocular landmarks, head pose features, and structured clinical attributes to generate interpretable predictions. Second, a curriculum learning-based imputation framework is designed to mitigate missing data by progressively leveraging structured variables and unstructured clinical notes to enhance diagnostic robustness under realistic data conditions. Evaluation on the PoseGaze-AHP dataset demonstrates robust diagnostic performance. AHP-CADNet achieves 96.9-99.0 percent accuracy across classification tasks and low prediction errors for continuous variables, with MAE ranging from 0.103 to 0.199 and R2 exceeding 0.93. The imputation framework maintains high accuracy across all clinical variables (93.46-99.78 percent with PubMedBERT), with clinical dependency modeling yielding significant improvements (p < 0.001). These findings confirm the effectiveness of both frameworks for automated diagnosis and recovery from missing data in clinical settings.

[119] EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario

Yiping Ma,Shiyu Hu,Buyuan Zhu,Yipei Wang,Yaxuan Kang,Shiqing Liu,Kang Hao Cheong

Main category: cs.CV

TL;DR: EduVerse是一个用户可定义的多智能体模拟空间，支持环境、智能体和会话的定制，并通过人机协同接口实现真实用户参与，基于CIE架构再现课堂认知、互动与长期演化动态。

Details

Motivation: 现有教育AI方法多集中于短期或单智能体场景，难以系统研究真实课堂中复杂的开放认知、社会互动、情感因素和长期发展过程，亟需一个能综合建模这些要素的可复现平台。 Method: 提出EduVerse，基于分层CIE（认知-交互-演化）架构构建多智能体模拟环境，支持自定义设置和多轮会话；通过人类用户介入机制实现人机共融，并在中学语文课堂中跨文本类型、环境和多会话进行验证。 Result: 实验结果显示：(1) 教学对齐度高，模拟课堂IRF比率接近真实课堂；(2) 群体互动显著且角色分化明显，网络密度合理，个体差异与教学稳定性并存；(3) 跨会话行为、情绪与认知演变趋势清晰，正向转变率平均提升11.7%，揭示结构化学习轨迹。 Conclusion: EduVerse在真实性、可复现性和可解释性之间取得平衡，为教育AI提供了一个可扩展的研究平台，未来将开源以促进跨学科研究。 Abstract: Reproducing cognitive development, group interaction, and long-term evolution in virtual classrooms remains a core challenge for educational AI, as real classrooms integrate open-ended cognition, dynamic social interaction, affective factors, and multi-session development rarely captured together. Existing approaches mostly focus on short-term or single-agent settings, limiting systematic study of classroom complexity and cross-task reuse. We present EduVerse, the first user-defined multi-agent simulation space that supports environment, agent, and session customization. A distinctive human-in-the-loop interface further allows real users to join the space. Built on a layered CIE (Cognition-Interaction-Evolution) architecture, EduVerse ensures individual consistency, authentic interaction, and longitudinal adaptation in cognition, emotion, and behavior-reproducing realistic classroom dynamics with seamless human-agent integration. We validate EduVerse in middle-school Chinese classes across three text genres, environments, and multiple sessions. Results show: (1) Instructional alignment: simulated IRF rates (0.28-0.64) closely match real classrooms (0.37-0.49), indicating pedagogical realism; (2) Group interaction and role differentiation: network density (0.27-0.40) with about one-third of peer links realized, while human-agent tasks indicate a balance between individual variability and instructional stability; (3) Cross-session evolution: the positive transition rate R+ increase by 11.7% on average, capturing longitudinal shifts in behavior, emotion, and cognition and revealing structured learning trajectories. Overall, EduVerse balances realism, reproducibility, and interpretability, providing a scalable platform for educational AI. The system will be open-sourced to foster cross-disciplinary research.

[120] SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Manolis Mylonas,Charalampia Zerva,Evlampios Apostolidis,Vasileios Mezaris

Main category: cs.CV

TL;DR: 本文提出了一种新的脚本驱动的多模态视频摘要方法SD-MVSum，结合视频的视觉和语音内容，并引入加权跨模态注意力机制，提升与用户脚本最相关视频片段的权重。同时扩展了两个大规模数据集以支持该任务，实验表明该方法优于现有SOTA方法。

Details

Motivation: 现有脚本驱动的视频摘要方法主要关注视觉内容，忽略了语音文本与脚本的相关性，限制了摘要质量。因此需要一种能融合多模态信息（视觉与语音）并更好利用用户脚本的方法。 Method: 提出SD-MVSum方法，使用加权跨模态注意力机制建模脚本与视频、脚本与转录文本之间的关系，利用语义相似性突出视频中与脚本最相关的部分。同时扩展了S-VideoXum和MrHiSum两个数据集，支持多模态训练与评估。 Result: 在扩展后的数据集上，SD-MVSum在脚本驱动和通用视频摘要任务中均优于现有的最先进方法，验证了其有效性与竞争力。 Conclusion: SD-MVSum通过融合视觉和语音模态，并利用加权跨模态注意力机制，显著提升了脚本驱动视频摘要的效果；扩展的数据集为后续研究提供了重要资源。 Abstract: In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

[121] A Hierarchical Geometry-guided Transformer for Histological Subtyping of Primary Liver Cancer

Anwen Lu,Mingxin Liu,Yiping Jiao,Hongyi Gong,Geyang Xu,Jun Chen,Jun Xu

Main category: cs.CV

TL;DR: 提出ARGUS模型，通过捕捉肿瘤微环境中的宏观-中观-微观层次信息，提升肝癌组织学亚型分类性能。

Details

Motivation: 现有方法未能充分挖掘全切片图像中蕴含的层次结构、肿瘤微环境和几何特征，导致肝癌亚型分类效果受限。 Method: 构建微几何特征表示细胞级模式，设计多层次视野对齐模块建模宏观与中观交互，并通过几何先验引导的融合策略实现多尺度特征融合。 Result: 在公共和私有数据集上实验表明，ARGUS在肝癌组织学亚型分类任务中达到最先进的性能。 Conclusion: ARGUS能有效整合多尺度病理特征，为肝癌精准诊断提供了强有力的工具。 Abstract: Primary liver malignancies are widely recognized as the most heterogeneous and prognostically diverse cancers of the digestive system. Among these, hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) emerge as the two principal histological subtypes, demonstrating significantly greater complexity in tissue morphology and cellular architecture than other common tumors. The intricate representation of features in Whole Slide Images (WSIs) encompasses abundant crucial information for liver cancer histological subtyping, regarding hierarchical pyramid structure, tumor microenvironment (TME), and geometric representation. However, recent approaches have not adequately exploited these indispensable effective descriptors, resulting in a limited understanding of histological representation and suboptimal subtyping performance. To mitigate these limitations, ARGUS is proposed to advance histological subtyping in liver cancer by capturing the macro-meso-micro hierarchical information within the TME. Specifically, we first construct a micro-geometry feature to represent fine-grained cell-level pattern via a geometric structure across nuclei, thereby providing a more refined and precise perspective for delineating pathological images. Then, a Hierarchical Field-of-Views (FoVs) Alignment module is designed to model macro- and meso-level hierarchical interactions inherent in WSIs. Finally, the augmented micro-geometry and FoVs features are fused into a joint representation via present Geometry Prior Guided Fusion strategy for modeling holistic phenotype interactions. Extensive experiments on public and private cohorts demonstrate that our ARGUS achieves state-of-the-art (SOTA) performance in histological subtyping of liver cancer, which provide an effective diagnostic tool for primary liver malignancies in clinical practice.

[122] Teleportraits: Training-Free People Insertion into Any Scene

Jialu Gao,K J Joseph,Fernando De La Torre

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的统一框架，利用预训练的文本到图像扩散模型，实现将参考图像中的人物自然地插入复杂场景中，同时保持人物身份和外观特征。

Details

Motivation: 现有方法通常将人物位置姿态确定与个性化生成作为独立问题处理，并依赖训练过程，忽略了二者之间的关联性，且难以实现高质量、无需训练的人体插入。 Method: 结合图像反演技术与无分类器引导，在扩散模型中引入掩码引导的自注意力机制，实现基于背景感知的全局编辑与高保真人像插入。 Result: 在多种复合场景图像中实现了最先进的视觉效果，能够准确理解场景可及性并合理放置人物，同时从单张参考图像中保留人物的身份、服饰和身体特征。 Conclusion: 该方法首次实现了无需训练的真实感人体插入，验证了预训练扩散模型在人-景融合任务中的强大潜力。 Abstract: The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.

[123] When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

Daniel Gonzálbez-Biosca,Josep Cabacas-Maso,Carles Ventura,Ismael Benito-Altamirano

Main category: cs.CV

TL;DR: 本文提出了一种用于多摄像头古典音乐会视频自动编辑的新型多模态架构，通过分解为“何时剪切”和“如何剪切”两个子任务，结合音频、图像嵌入和时间特征，在检测剪切点和视觉镜头选择上优于先前方法。

Details

Motivation: 自动化视频编辑在计算机视觉和多媒体领域研究较少，尤其是在视频生成和场景理解快速发展的背景下，因此需要探索更有效的多模态方法来处理复杂的多摄像头视频编辑任务。 Method: 提出一种轻量级卷积-Transformer混合模型，融合对数梅尔频谱图、可选图像嵌入和标量时间特征以解决‘何时剪切’问题；对于‘如何剪切’，采用CLIP编码器替代传统ResNet，并限制干扰项选择来自同一场音乐会的片段。使用伪标签方法构建数据集，实现自动聚类分段。 Result: 模型在检测剪切点方面优于现有基线方法，并在视觉镜头选择上表现出竞争力，提升了多模态自动化视频编辑的性能。 Conclusion: 所提出的多模态架构有效解决了多摄像头音乐视频编辑中的关键问题，推动了该领域的技术进展。 Abstract: Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

[124] Development and Validation of a Low-Cost Imaging System for Seedling Germination Kinetics through Time-Cumulative Analysis

M. Torrente,A. Follador,A. Calcante,P. Casati,R. Oberti

Main category: cs.CV

TL;DR: 本研究开发了一种基于图像的分析方法，结合时间序列信息，用于监测立枯丝核菌（R. solani）感染对生菜种子萌发和早期生长的影响，实现了在复杂生长条件下准确识别和量化幼苗。

Details

Motivation: 评估R. solani病原体对Lactuca sativa L.种子萌发和早期发育的影响，并解决传统图像分析在幼苗重叠时失效的问题。 Method: 使用低成本多摄像头系统连续采集感染组与对照组的萌发图像，开发了一种融合形态学和空间特征并结合时间序列信息的新型图像分析流程，以实现对重叠幼苗的准确分割与追踪。 Result: 该方法在幼苗计数和活力评估中表现出高精度（R²=0.98，RMSE=1.12），显著优于传统图像分析技术，尤其适用于密集或缠绕生长的后期阶段；结果显示R. solani显著降低种子萌发率和幼苗活力。 Conclusion: 结合低成本成像设备与先进计算方法可实现非破坏性、可扩展的植物表型分析，所提出的时间整合策略有效提升了复杂场景下图像分析的准确性与鲁棒性。 Abstract: The study investigates the effects of R. solani inoculation on the germination and early development of Lactuca sativa L. seeds using a low-cost, image-based monitoring system. Multiple cameras were deployed to continuously capture images of the germination process in both infected and control groups. The objective was to assess the impact of the pathogen by analyzing germination dynamics and growth over time. To achieve this, a novel image analysis pipeline was developed. The algorithm integrates both morphological and spatial features to identify and quantify individual seedlings, even under complex conditions where traditional image analyses fails. A key innovation of the method lies in its temporal integration: each analysis step considers not only the current status but also their developmental across prior time points. This approach enables robust discrimination of individual seedlings, especially when overlapping leaves significantly hinder object separation. The method demonstrated high accuracy in seedling counting and vigor assessment, even in challenging scenarios characterized by dense and intertwined growth. Results confirm that R. solani infection significantly reduces germination rates and early seedling vigor. The study also validates the feasibility of combining low-cost imaging hardware with advanced computational tools to obtain phenotyping data in a non-destructive and scalable manner. The temporal integration enabled accurate quantification of germinated seeds and precise determination of seedling emergence timing. This approach proved particularly effective in later stages of the experiment, where conventional segmentation techniques failed due to overlapping or intertwined seedlings, making accurate counting. The method achieved a coefficient of determination of 0.98 and a root mean square error (RMSE) of 1.12, demonstrating its robustness and reliability.

[125] Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension

Jike Zhong,Yuxiang Lai,Xiaofeng Yang,Konstantinos Psounis

Main category: cs.CV

TL;DR: 本文提出了一种基于对象级别的语义视觉标记化方法，通过在掩码图像建模中对视觉对象而非随机图像块进行掩码，增强了视觉模型的语义和上下文学习能力，显著提升了其在视觉问答等任务中的推理表现。

Details

Motivation: 当前视觉模型在推理和上下文学习方面落后于语言模型，主要因为现有ViT训练缺乏语义和上下文引导。本文旨在通过引入语义接地的目标来缩小这一差距。 Method: 将“对象”作为视觉中的“词”进行建模，在掩码图像建模（MIM）框架中对完整视觉对象进行掩码，从而推动模型学习视觉元素间的全局上下文和语义关系。 Result: 实验证明，对象级表征有助于学习真实世界的数据分布，避免了像素平均的捷径学习；在VQA、GQA、ScienceQA等多模态任务中，结合MLLM展现出更强的推理与上下文理解能力。 Conclusion: 对象级编码能有效提升视觉模型的语义学习与推理能力，为构建更强的视觉编码器和分词器提供了可行方向。 Abstract: Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning

[126] AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models

Shihao Zhu,Bohan Cao,Ziheng Ouyang,Zhen Li,Peng-Tao Jiang,Qibin Hou

Main category: cs.CV

TL;DR: 提出AgeBooth，一种无需大量跨年龄配对数据的年龄特定微调方法，通过年龄条件提示混合和LoRA融合策略实现基于单张参考图像的高保真、身份一致的跨年龄人脸生成。

Details

Motivation: 现有扩散模型在保持身份一致性的同时难以精确控制年龄，且微调通常依赖昂贵的跨年龄配对数据集。 Method: 引入年龄条件提示混合和基于SVDMix的年龄特定LoRA融合策略，在adapter-based身份个性化模型上进行年龄特定微调。 Result: 实验表明，AgeBooth在年龄控制精度和生成图像质量方面优于现有的编辑方法，能从单张参考图像生成高质量的跨年龄人像。 Conclusion: AgeBooth有效提升了扩散模型在无配对数据情况下的年龄控制能力，实现了身份一致且逼真的跨年龄人脸生成。 Abstract: Recent diffusion model research focuses on generating identity-consistent images from a reference photo, but they struggle to accurately control age while preserving identity, and fine-tuning such models often requires costly paired images across ages. In this paper, we propose AgeBooth, a novel age-specific finetuning approach that can effectively enhance the age control capability of adapterbased identity personalization models without the need for expensive age-varied datasets. To reduce dependence on a large amount of age-labeled data, we exploit the linear nature of aging by introducing age-conditioned prompt blending and an age-specific LoRA fusion strategy that leverages SVDMix, a matrix fusion technique. These techniques enable high-quality generation of intermediate-age portraits. Our AgeBooth produces realistic and identity-consistent face images across different ages from a single reference image. Experiments show that AgeBooth achieves superior age control and visual quality compared to previous state-of-the-art editing-based methods.

[127] Data Factory with Minimal Human Effort Using VLMs

Jiaojiao Ye,Jiaxing Zhong,Qian Xie,Yuzhou Zhou,Niki Trigoni,Andrew Markham

Main category: cs.CV

TL;DR: 提出一种无需训练的扩散模型管道，结合ControlNet和视觉语言模型生成带像素级标签的合成图像，提升语义分割性能。

Details

Motivation: 传统数据增强方法在操作高层语义属性（如材质和纹理）方面存在困难，而现有基于扩散模型的方法计算成本高或性能不足。 Method: 结合预训练的ControlNet和视觉语言模型（VLMs），引入多路提示生成器、掩码生成器和高质量图像选择模块，实现无需训练的数据增强 pipeline。 Result: 在PASCAL-5i和COCO-20i数据集上的一次性语义分割任务中表现出色，优于当前方法。 Conclusion: 该方法能有效生成高保真、多样化的标注图像，显著提升下游任务性能，且无需额外训练。 Abstract: Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.

[128] Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect

Amirtaha Amanzadi,Zahra Dehghanian,Hamid Beigy,Hamid R. Rabiee

Main category: cs.CV

TL;DR: 本文提出了OmniGen基准和FusionDetect方法，用于提升生成图像检测在跨生成器和跨视觉域上的泛化能力。

Details

Motivation: 现有研究多关注跨生成器的泛化，忽视了跨视觉域的挑战，需构建更真实的评估环境。 Method: 提出FusionDetect，融合CLIP和Dinov2两种冻结基础模型的特征，构建能自适应生成器内容和设计变化的统一特征空间。 Result: FusionDetect在已有基准上平均准确率提升3.87%，精度提升6.13%；在OmniGen上准确率提高4.48%，并对常见图像扰动表现出强鲁棒性。 Conclusion: FusionDetect结合新提出的OmniGen基准，为通用AI图像检测提供了新的高性能检测器、评估框架和数据支持。 Abstract: The rapid development of generative models has made it increasingly crucial to develop detectors that can reliably detect synthetic images. Although most of the work has now focused on cross-generator generalization, we argue that this viewpoint is too limited. Detecting synthetic images involves another equally important challenge: generalization across visual domains. To bridge this gap,we present the OmniGen Benchmark. This comprehensive evaluation dataset incorporates 12 state-of-the-art generators, providing a more realistic way of evaluating detector performance under realistic conditions. In addition, we introduce a new method, FusionDetect, aimed at addressing both vectors of generalization. FusionDetect draws on the benefits of two frozen foundation models: CLIP & Dinov2. By deriving features from both complementary models,we develop a cohesive feature space that naturally adapts to changes in both thecontent and design of the generator. Our extensive experiments demonstrate that FusionDetect delivers not only a new state-of-the-art, which is 3.87% more accurate than its closest competitor and 6.13% more precise on average on established benchmarks, but also achieves a 4.48% increase in accuracy on OmniGen,along with exceptional robustness to common image perturbations. We introduce not only a top-performing detector, but also a new benchmark and framework for furthering universal AI image detection. The code and dataset are available at http://github.com/amir-aman/FusionDetect

[129] ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Yongxuan Lyu,Guangfeng Jiang,Hongsi Liu,Jun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为ALISE的新框架，完全无需人工标注即可实现LiDAR点云实例分割，通过视觉基础模型和时空投票模块生成高质量伪标签，并引入2D先验损失和基于原型的对比损失来提升特征学习，在无监督3D实例分割任务中达到了新的SOTA性能，甚至超越了使用GT 2D边界框监督的方法。

Details

Motivation: 由于户外LiDAR点云实例分割的人工标注成本极高，现有方法仍依赖部分人工标注，因此需要一种完全无需标注的解决方案。 Method: 利用文本和图像引导的视觉基础模型（VFMs）生成初始伪标签，通过结合2D和3D语义的时空投票模块优化标签，并设计2D先验损失和原型对比损失以增强3D特征学习。 Result: 在无监督3D实例分割任务上达到新SOTA，mAP达到50.95%，超过使用真实2D框监督的MWSIS方法2.53%。 Conclusion: ALISE首次实现了完全无需标注的LiDAR实例分割，通过多模态引导和语义一致性建模显著提升了无监督性能，展示了替代人工标注的潜力。 Abstract: The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

[130] OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Zexin Zheng,Huangyu Dai,Lingtao Mao,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li,Kun Gai

Main category: cs.CV

TL;DR: 本文提出了一种端到端的生成式框架OneVision，用于解决传统视觉搜索中多阶段级联架构存在的表征差异与优化目标冲突问题。

Details

Motivation: 传统多阶段视觉搜索架构在查询图像的不同视图表征之间存在差异，且各阶段优化目标不一致，难以同时优化用户体验和转化率。 Method: 提出OneVision框架，基于视觉对齐的残差量化编码（VRQ）实现多视角表征对齐，并采用多阶段语义对齐方案融合用户个性化信息。 Result: 离线评估中性能与在线MCA相当，推理效率提升21%；A/B测试中CTR提升2.15%，CVR提升2.27%，订单量提升3.12%。 Conclusion: 以语义ID为中心的生成式架构能够统一检索与个性化，简化服务流程，同时提升效率和业务指标。 Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

[131] A Novel Technique for Robust Training of Deep Networks With Multisource Weak Labeled Remote Sensing Data

Gianmarco Perantoni,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出一种利用多源弱标签数据与可靠小数据集结合的深度学习训练方法，通过引入描述各数据源错误统计特性的转移矩阵，在梯度层面加权不同来源的标签，提升遥感图像场景分类性能。

Details

Motivation: 深度神经网络需要大量高质量标注数据，但遥感领域高可靠性标签获取成本高、数量有限，而存在大量低质量但易得的标注数据（如过时地图），需有效利用这些弱监督信息以提升模型性能。 Method: 将少量高可靠性标注数据与一个或多个弱标签数据源融合构建多源标注数据集，并设计一种新的训练策略：利用各数据源的转移矩阵建模其标注错误分布，并在训练过程中将该矩阵嵌入标签，实现对不同来源样本在梯度更新时按类别进行差异化加权。 Result: 在多个数据集上的实验验证了该方法的有效性，结果表明该方法具有较强的鲁棒性，能够有效利用不可靠的标签数据提升模型性能。 Conclusion: 所提方法通过建模和利用多源标签数据的可靠性差异，在不增加额外标注成本的前提下，显著提升了深度模型在遥感图像场景分类中的表现，为利用弱监督信息提供了有效解决方案。 Abstract: Deep learning has gained broad interest in remote sensing image scene classification thanks to the effectiveness of deep neural networks in extracting the semantics from complex data. However, deep networks require large amounts of training samples to obtain good generalization capabilities and are sensitive to errors in the training labels. This is a problem in remote sensing since highly reliable labels can be obtained at high costs and in limited amount. However, many sources of less reliable labeled data are available, e.g., obsolete digital maps. In order to train deep networks with larger datasets, we propose both the combination of single or multiple weak sources of labeled data with a small but reliable dataset to generate multisource labeled datasets and a novel training strategy where the reliability of each source is taken in consideration. This is done by exploiting the transition matrices describing the statistics of the errors of each source. The transition matrices are embedded into the labels and used during the training process to weigh each label according to the related source. The proposed method acts as a weighting scheme at gradient level, where each instance contributes with different weights to the optimization of different classes. The effectiveness of the proposed method is validated by experiments on different datasets. The results proved the robustness and capability of leveraging on unreliable source of labels of the proposed method.

[132] Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection

I. M. De la Jara,C. Rodriguez-Opazo,D. Teney,D. Ranasinghe,E. Abbasnejad

Main category: cs.CV

TL;DR: 提出一种无需训练的基于中间层表示的OOD检测方法，通过熵准则自动选择最具互补信息的层，显著提升远端和近端OOD检测性能。

Details

Motivation: 现有OOD检测方法多依赖预训练模型的最后一层表示，忽略了中间层可能包含的丰富分布偏移信号，限制了检测性能。 Method: 利用残差连接带来的中间层表示多样性，引入基于熵的准则，在无需访问OOD数据的情况下自动识别最具互补信息的中间层，并融合这些层的表示进行OOD检测。 Result: 在多种模型架构和训练目标下，该方法在远端OOD上检测准确率最高提升10%，近端OOD上超过7%，优于当前最先进的无需训练的方法。 Conclusion: 中间层蕴含丰富的OOD信号，合理利用可显著提升检测效果，为OOD检测提供了新方向，并揭示了不同训练目标和架构对置信度基OOD方法的影响。 Abstract: Out-of-distribution (OOD) detection is essential for reliably deploying machine learning models in the wild. Yet, most methods treat large pre-trained models as monolithic encoders and rely solely on their final-layer representations for detection. We challenge this wisdom. We reveal the \textit{intermediate layers} of pre-trained models, shaped by residual connections that subtly transform input projections, \textit{can} encode \textit{surprisingly rich and diverse signals} for detecting distributional shifts. Importantly, to exploit latent representation diversity across layers, we introduce an entropy-based criterion to \textit{automatically} identify layers offering the most complementary information in a training-free setting -- \textit{without access to OOD data}. We show that selectively incorporating these intermediate representations can increase the accuracy of OOD detection by up to \textbf{$10\%$} in far-OOD and over \textbf{$7\%$} in near-OOD benchmarks compared to state-of-the-art training-free methods across various model architectures and training objectives. Our findings reveal a new avenue for OOD detection research and uncover the impact of various training objectives and model architectures on confidence-based OOD detection methods.

[133] Rasterized Steered Mixture of Experts for Efficient 2D Image Regression

Yi-Hsin Li,Thomas Sikora,Sebastian Knorr,Mårten Sjöström

Main category: cs.CV

TL;DR: 提出了一种基于光栅化的优化策略，结合Steered Mixture of Experts的边缘感知机制与光栅化高斯核渲染的效率，加速二维图像回归任务，同时保持模型稀疏性和重建质量。

Details

Motivation: Steered Mixture of Experts框架在图像重建等任务中表现优异，但计算成本高，限制了实际应用。 Method: 引入基于光栅化的优化策略，用光栅化公式替代全局迭代优化，实现快速参数更新和更高效的内存表示。 Result: 该方法显著提升了计算效率和内存使用，并支持原生超分辨率和图像去噪等标准光栅化方法难以实现的应用。 Conclusion: 结合光栅化优化与Steered Mixture of Experts的边缘感知结构，在计算效率与重建保真度之间实现了新的平衡。 Abstract: The Steered Mixture of Experts regression framework has demonstrated strong performance in image reconstruction, compression, denoising, and super-resolution. However, its high computational cost limits practical applications. This work introduces a rasterization-based optimization strategy that combines the efficiency of rasterized Gaussian kernel rendering with the edge-aware gating mechanism of the Steered Mixture of Experts. The proposed method is designed to accelerate two-dimensional image regression while maintaining the model's inherent sparsity and reconstruction quality. By replacing global iterative optimization with a rasterized formulation, the method achieves significantly faster parameter updates and more memory-efficient model representations. In addition, the proposed framework supports applications such as native super-resolution and image denoising, which are not directly achievable with standard rasterized Gaussian kernel approaches. The combination of fast rasterized optimization with the edge-aware structure of the Steered Mixture of Experts provides a new balance between computational efficiency and reconstruction fidelity for two-dimensional image processing tasks.

[134] Deformable Image Registration for Self-supervised Cardiac Phase Detection in Multi-View Multi-Disease Cardiac Magnetic Resonance Images

Sven Koehler,Sarah Kaye Mueller,Jonathan Kiekenap,Gerald Greil,Tarique Hussain,Samir Sarikouch,Florian André,Norbert Frey,Sandy Engelhardt

Main category: cs.CV

TL;DR: 提出一种自监督深度学习方法，通过图像配准和1D运动描述符检测短轴和四腔长轴心脏磁共振 cine 图像中的多个关键帧，相比基于容积的方法显著提高了检测精度。

Details

Motivation: 传统自动方法仅依赖左心室容积曲线检测心动周期关键帧（如收缩末期和舒张末期），无法深入反映心肌运动特征，限制了对心脏动态的精细分析。 Method: 首先从图像中提取密集可变形配准场，计算1D运动描述符以捕捉全局心脏收缩与舒张模式；然后基于该曲线使用简单规则确定多个关键帧；采用自监督深度学习框架，在三个公开多中心多病种数据集上进行独立评估，并在罕见先天性心脏病数据集上验证泛化能力。 Result: 在舒张末期和收缩末期检测上，相比基于容积的方法，短轴视图精度提升30%-51%，四腔长轴视图提升11%-47%（以循环帧差cFD衡量）；平均cFD在短轴下低于1.31帧，长轴下低于1.73帧；成功检测出五个（短轴）和四个（长轴）关键帧。 Conclusion: 该自监督方法能更准确地检测心脏磁共振视频中的多个关键帧，支持跨患者、跨周期的心脏动态时间对齐分析，不受心动周期或相位长度差异影响，具有良好的临床应用潜力。 Abstract: Cardiovascular magnetic resonance (CMR) is the gold standard for assessing cardiac function, but individual cardiac cycles complicate automatic temporal comparison or sub-phase analysis. Accurate cardiac keyframe detection can eliminate this problem. However, automatic methods solely derive end-systole (ES) and end-diastole (ED) frames from left ventricular volume curves, which do not provide a deeper insight into myocardial motion. We propose a self-supervised deep learning method detecting five keyframes in short-axis (SAX) and four-chamber long-axis (4CH) cine CMR. Initially, dense deformable registration fields are derived from the images and used to compute a 1D motion descriptor, which provides valuable insights into global cardiac contraction and relaxation patterns. From these characteristic curves, keyframes are determined using a simple set of rules. The method was independently evaluated for both views using three public, multicentre, multidisease datasets. M&Ms-2 (n=360) dataset was used for training and evaluation, and M&Ms (n=345) and ACDC (n=100) datasets for repeatability control. Furthermore, generalisability to patients with rare congenital heart defects was tested using the German Competence Network (GCN) dataset. Our self-supervised approach achieved improved detection accuracy by 30% - 51% for SAX and 11% - 47% for 4CH in ED and ES, as measured by cyclic frame difference (cFD), compared with the volume-based approach. We can detect ED and ES, as well as three additional keyframes throughout the cardiac cycle with a mean cFD below 1.31 frames for SAX and 1.73 for LAX. Our approach enables temporally aligned inter- and intra-patient analysis of cardiac dynamics, irrespective of cycle or phase lengths. GitHub repository: https://github.com/Cardio-AI/cmr-multi-view-phase-detection.git

[135] Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

Ruyang Liu,Shangkun Sun,Haoran Tang,Ge Li,Wei Gao

Main category: cs.CV

TL;DR: 本文提出Flow4Agent，一种利用光流运动先验来提升多模态大模型对长视频理解能力的新框架，通过时间粒度优化和运动令牌剪枝减少时空冗余，在多个长视频基准上取得领先性能。

Details

Motivation: 长视频理解面临时空内容冗余和多模态大模型上下文长度受限的挑战，现有方法依赖语义先验（如CLIP）提取关键信息，缺乏对运动信息的有效利用。 Method: 提出Flow4Agent框架，包含两个核心模块：1）时间粒度优化（TGO），结合粗略光流先验聚类相似帧并用语义先验过滤无关场景；2）运动令牌剪枝（MTP），利用细粒度光流信息剪枝帧内高冗余视觉令牌。 Result: 在Video-MME、MLVU和LongVideoBench等多个长视频理解基准上显著优于现有方法，分别达到64.7%、71.4%和60.4%的性能表现，尤其在小时级视频理解任务中效果突出。 Conclusion: Flow4Agent通过引入光流运动先验，有效缓解了长视频中的时空冗余问题，为基于LLM的长视频理解提供了新的思路，并在多个 benchmarks 上实现了最先进的性能。 Abstract: Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

[136] acia-workflows: Automated Single-cell Imaging Analysis for Scalable and Deep Learning-based Live-cell Imaging Analysis Workflows

Johannes Seiffarth,Keitaro Kasahara,Michelle Bund,Benita Lückel,Richard D. Paul,Mathias Pesch,Lennart Witting,Michael Bott,Dietrich Kohlheyer,Katharina Nöh

Main category: cs.CV

TL;DR: 本文介绍了一个名为acia-workflows的开源平台，集成了基于深度学习的细胞分割与追踪工具，通过模块化、可重复使用的Jupyter Notebook工作流，支持高通量活细胞成像数据的自动化、可扩展分析。

Details

Motivation: 高通量活细胞成像产生大量数据，传统分析方法难以高效处理，亟需将先进的深度学习技术整合到用户友好、可复用的工作流中，以促进生物学研究中的常规应用。 Method: 开发了acia Python库，集成八种深度学习分割与追踪算法，并构建了包含分析流程、依赖项、文档和可视化的Jupyter Notebook工作流，形成模块化、可扩展的分析平台。 Result: 提供了超过十个开源应用工作流，支持多种微流控活细胞成像实验的分析，如生长速率比较和分钟级动态响应量化，验证了平台的实用性与灵活性。 Conclusion: acia-workflows平台有效解决了高通量活细胞成像数据分析的复杂性问题，提升了分析的可访问性、可重复性和可扩展性，推动了单细胞动态研究在生命科学中的应用。 Abstract: Live-cell imaging (LCI) technology enables the detailed spatio-temporal characterization of living cells at the single-cell level, which is critical for advancing research in the life sciences, from biomedical applications to bioprocessing. High-throughput setups with tens to hundreds of parallel cell cultivations offer the potential for robust and reproducible insights. However, these insights are obscured by the large amount of LCI data recorded per experiment. Recent advances in state-of-the-art deep learning methods for cell segmentation and tracking now enable the automated analysis of such large data volumes, offering unprecedented opportunities to systematically study single-cell dynamics. The next key challenge lies in integrating these powerful tools into accessible, flexible, and user-friendly workflows that support routine application in biological research. In this work, we present acia-workflows, a platform that combines three key components: (1) the Automated live-Cell Imaging Analysis (acia) Python library, which supports the modular design of image analysis pipelines offering eight deep learning segmentation and tracking approaches; (2) workflows that assemble the image analysis pipeline, its software dependencies, documentation, and visualizations into a single Jupyter Notebook, leading to accessible, reproducible and scalable analysis workflows; and (3) a collection of application workflows showcasing the analysis and customization capabilities in real-world applications. Specifically, we present three workflows to investigate various types of microfluidic LCI experiments ranging from growth rate comparisons to precise, minute-resolution quantitative analyses of individual dynamic cells responses to changing oxygen conditions. Our collection of more than ten application workflows is open source and publicly available at https://github.com/JuBiotech/acia-workflows.

[137] BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data

Arefin Ittesafun Abian,Debopom Sutradhar,Md Rafi Ur Rashid,Reem E. Mohamed,Md Rafiqul Islam,Asif Karim,Kheng Cher Yeo,Sami Azam

Main category: cs.CV

TL;DR: 本文提出了一种用于昆虫分类的新型自动化机器学习模型BioAutoML-NAS，结合图像与元数据进行多模态学习，并采用神经架构搜索（NAS）自动优化网络结构，在大规模数据集上实现了优于现有方法的性能。

Details

Motivation: 昆虫分类对农业管理和生态研究至关重要，但由于昆虫特征复杂、类别不平衡和数据规模大，现有方法面临挑战。因此，需要一种高效且准确的自动化分类模型。 Method: 提出BioAutoML-NAS，利用图像和元数据的多模态融合；通过神经架构搜索（NAS）自动学习最优网络连接结构；采用交替双层优化策略联合更新网络权重与架构参数，并引入零操作剪枝不重要连接，形成稀疏高效的网络结构。 Result: 在BIOSCAN-5M数据集上达到96.81%准确率、97.46%精确率、96.81%召回率和97.05% F1分数，性能优于现有迁移学习、Transformer、AutoML和NAS方法约8%-16%；在Insects-1M数据集上获得93.25%准确率和93.22% F1分数。 Conclusion: BioAutoML-NAS通过多模态融合与自动神经架构搜索，实现了高效、精准的昆虫分类，适用于大规模实际应用，有助于推动可持续农业发展。 Abstract: Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing sparse, efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, transformer, AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.

[138] $\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

Yanran Zhang,Bingyao Yu,Yu Zheng,Wenzhao Zheng,Yueqi Duan,Lei Chen,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于离散分布差异感知量化误差（D$^3$QE）的视觉自回归生成图像检测方法，通过分析码本中真实与伪造图像的频率分布偏差，结合动态统计信息与Transformer架构，实现了对多种AR模型生成图像的高效检测，具有良好的泛化性与鲁棒性。

Details

Motivation: 随着视觉自回归模型在图像生成上的突破，其生成图像的检测成为新挑战。传统检测方法难以应对AR模型离散token生成和向量量化表示的独特特性，因此需要专门针对其生成机制设计新的检测方法。 Method: 提出D$^3$QE方法，利用真实与伪造图像在码本频率分布上的差异，构建离散分布差异感知的Transformer模型，将动态码本频率统计融入注意力机制，并融合语义特征与量化误差潜在表示进行检测。 Result: 在涵盖7种主流视觉AR模型的ARForensics数据集上实验表明，该方法在检测精度、跨模型泛化能力和抗现实干扰方面均表现出色。 Conclusion: D$^3$QE有效捕捉了自回归生成图像在量化过程中的独特模式，为AR模型生成图像的检测提供了新思路，并展现出实际应用潜力。 Abstract: The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D$^3$QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D$^3$QE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.

[139] Efficient Universal Models for Medical Image Segmentation via Weakly Supervised In-Context Learning

Jiesi Hu,Yanwu Yang,Zhiyu Ye,Jinyan Zhou,Jianfeng Cao,Hanyang Peng,Ting Ma

Main category: cs.CV

TL;DR: 提出了一种弱监督上下文学习（WS-ICL）方法，使用弱标签（如边界框或点）替代密集标注，显著降低医学图像分割中的标注成本，同时保持与常规ICL模型相当的性能。

Details

Motivation: 现有的通用医学图像分割模型（如交互式和上下文学习模型）依赖大量精细标注，标注成本高，且需重复用户输入，限制了实际应用。 Method: 提出WS-ICL，利用弱提示（如边界框或点）构建上下文，避免使用像素级密集标签，减少标注负担，并在三个基准数据集上进行评估。 Result: WS-ICL在显著降低标注成本的同时，性能与传统ICL模型相当，并在交互式设置下表现出强竞争力。 Conclusion: WS-ICL为高效、统一的医学图像分割通用模型提供了有前景的解决方案，大幅减少了对精细标注的依赖。 Abstract: Universal models for medical image segmentation, such as interactive and in-context learning (ICL) models, offer strong generalization but require extensive annotations. Interactive models need repeated user prompts for each image, while ICL relies on dense, pixel-level labels. To address this, we propose Weakly Supervised In-Context Learning (WS-ICL), a new ICL paradigm that leverages weak prompts (e.g., bounding boxes or points) instead of dense labels for context. This approach significantly reduces annotation effort by eliminating the need for fine-grained masks and repeated user prompting for all images. We evaluated the proposed WS-ICL model on three held-out benchmarks. Experimental results demonstrate that WS-ICL achieves performance comparable to regular ICL models at a significantly lower annotation cost. In addition, WS-ICL is highly competitive even under the interactive paradigm. These findings establish WS-ICL as a promising step toward more efficient and unified universal models for medical image segmentation. Our code and model are publicly available at https://github.com/jiesihu/Weak-ICL.

[140] Kaputt: A Large-Scale Dataset for Visual Defect Detection

Sebastian Höfer,Dorian Henning,Artemij Amiranashvili,Douglas Morrison,Mariliza Tzes,Ingmar Posner,Marc Matvienko,Alessandro Rennola,Anton Milan

Main category: cs.CV

TL;DR: 本文提出了一种用于物流场景中缺陷检测的大规模新数据集，相较于现有的MVTec-AD和VisA等制造场景数据集，该数据集规模更大、物体多样性更高，且更具挑战性。现有最先进的异常检测方法在该数据集上的表现显著下降（AUROC不超过56.96%），表明当前方法难以应对姿态和外观变化大的情况。作者通过广泛实验验证了问题的难度，并希望推动零售物流中异常检测的研究。

Details

Motivation: 现有的工业异常检测数据集主要针对制造场景，具有高度可控的姿态和有限的物体类别，已趋于饱和。而零售物流中的异常检测面临更大的物体多样性和姿态变化，现有方法表现不佳，因此需要一个更具挑战性的新基准数据集。 Method: 构建了一个包含超过23万张图像、2.9万个缺陷实例和4.8万种不同物体的大规模数据集，并对多种最先进的异常检测方法进行了全面评估，以验证其在该场景下的性能局限。 Result: 现有最先进方法在该数据集上的AUROC不超过56.96%，远低于其在MVTec-AD等数据集上的表现（高达99.9%），表明该数据集更具挑战性，且现有方法难以应对复杂的姿态和外观变化。 Conclusion: 该研究填补了零售物流场景下异常检测基准的空白，提供了一个大规模、高多样性的新数据集，揭示了现有方法的局限性，并为未来研究提供了新的方向和挑战。 Abstract: We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.

[141] Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging

Ron Keuth,Paul Kaftan,Mattias P. Heinrich

Main category: cs.CV

TL;DR: 本文首次全面研究了MetaFormer架构中不同token mixer在医学图像分类和语义分割任务中的表现，发现低复杂度的混合器（如分组卷积或池化）在分类任务中已足够，而在分割任务中卷积类混合器的局部归纳偏置至关重要，推荐使用分组卷积。

Details

Motivation: 尽管MetaFormer在自然图像中被广泛研究，但其在医学影像中的应用较少，且缺乏对不同token mixer的系统比较，可能错失更优设计。 Method: 在MetaFormer框架下，系统评估了基于池化、卷积和注意力机制的token mixer，在八个涵盖多种模态和挑战的医学图像数据集上进行图像分类与语义分割实验，并分析预训练权重迁移的有效性。 Result: 分类任务中，低复杂度token mixer（如分组卷积、池化）性能足够；预训练权重跨mixer迁移仍有效。分割任务中，卷积类mixer的局部归纳偏置关键，分组卷积因效率高且参数少成为首选，通道MLP已足以处理跨通道交互。 Conclusion: 在医学图像分析中，MetaFormer的成功不依赖复杂token mixer；简单且高效的分组卷积是平衡性能与计算成本的最佳选择，尤其适用于资源受限的医疗场景。 Abstract: The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions. Our code is available on GitHub.

[142] Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis

Eashan Adhikarla,Yixin Liu,Brian D. Davison

Main category: cs.CV

TL;DR: 本文综述了扩散模型在低光照图像增强（LLIE）中的应用，提出了一个涵盖六类方法的多视角分类体系，并对生成对抗网络和基于Transformer的最先进方法进行了深入的性能比较，探讨了实际部署挑战及未来发展方向。

Details

Motivation: 低光照条件下图像质量下降会影响安全关键应用的表现，因此需要有效的图像增强技术来提高可见性。 Method: 提出了一种包含内在分解、光谱与潜在、加速、引导、多模态和自主六个类别的多视角分类法，系统地分析了扩散模型在LLIE中的应用。 Result: 提供了最新的扩散模型在LLIE中应用的批判性分析，包括与其他先进方法的性能对比、实际部署挑战以及对未来研究方向的展望。 Conclusion: 该综述旨在通过突出趋势和揭示开放的研究问题，指导下一代基于扩散模型的LLIE研究。 Abstract: Low-light image enhancement (LLIE) is vital for safety-critical applications such as surveillance, autonomous navigation, and medical imaging, where visibility degradation can impair downstream task performance. Recently, diffusion models have emerged as a promising generative paradigm for LLIE due to their capacity to model complex image distributions via iterative denoising. This survey provides an up-to-date critical analysis of diffusion models for LLIE, distinctively featuring an in-depth comparative performance evaluation against Generative Adversarial Network and Transformer-based state-of-the-art methods, a thorough examination of practical deployment challenges, and a forward-looking perspective on the role of emerging paradigms like foundation models. We propose a multi-perspective taxonomy encompassing six categories: Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal, and Autonomous; that map enhancement methods across physical priors, conditioning schemes, and computational efficiency. Our taxonomy is grounded in a hybrid view of both the model mechanism and the conditioning signals. We evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs between interpretability, generalization, and inference efficiency. We also discuss real-world deployment constraints (e.g., memory, energy use) and ethical considerations. This survey aims to guide the next generation of diffusion-based LLIE research by highlighting trends and surfacing open research questions, including novel conditioning, real-time adaptation, and the potential of foundation models.

[143] A Dynamic Mode Decomposition Approach to Morphological Component Analysis

Owen T. Huber,Raghu G. Raj,Tianyu Chen,Zacharie I. Idriss

Main category: cs.CV

TL;DR: 提出一种基于场景动态变化的自适应视频表示方法，通过动态形态成分分析（DMCA）实现信号分离与去噪。

Details

Motivation: 传统MCA使用预定义字典分离信号源，难以适应复杂动态内容，因此需要数据驱动的自适应方法。 Method: 利用动态模态分解特征值的聚类生成数据驱动的MCA字典，提出动态形态成分分析（DMCA）。 Result: 在Adobe 240fps视频去噪、微弱目标增强及ISAR图像中成功分离自行车与风杂波，验证了DMCA的有效性。 Conclusion: DMCA能有效学习自适应视频表示，在多种应用场景中展现出优异的信号分离与去噪性能。 Abstract: This paper introduces a novel methodology of adapting the representation of videos based on the dynamics of their scene content variation. In particular, we demonstrate how the clustering of dynamic mode decomposition eigenvalues can be leveraged to learn an adaptive video representation for separating structurally distinct morphologies of a video. We extend the morphological component analysis (MCA) algorithm, which uses multiple predefined incoherent dictionaries and a sparsity prior to separate distinct sources in signals, by introducing our novel eigenspace clustering technique to obtain data-driven MCA dictionaries, which we call dynamic morphological component analysis (DMCA). After deriving our novel algorithm, we offer a motivational example of DMCA applied to a still image, then demonstrate DMCA's effectiveness in denoising applications on videos from the Adobe 240fps dataset. Afterwards, we provide an example of DMCA enhancing the signal-to-noise ratio of a faint target summed with a sea state, and conclude the paper by applying DMCA to separate a bicycle from wind clutter in inverse synthetic aperture radar images.

[144] Diffusion-Based Image Editing for Breaking Robust Watermarks

Yunyi Ni,Finn Carter,Ze Niu,Emily Davis,Bo Zhang

Main category: cs.CV

TL;DR: 本文研究了基于扩散模型的图像生成与编辑技术对鲁棒性隐形水印的破坏能力，提出了一种引导扩散攻击方法，可在保持图像视觉质量的同时有效去除水印，并从理论上证明了扩散变换会消除水印与图像间的互信息，实验表明该攻击对多种先进水印方法均有效。

Details

Motivation: 随着扩散模型等生成式AI技术的发展，传统设计用于抵抗常规扰动的鲁棒水印面临新的威胁，亟需研究其安全性漏洞并提出相应攻击方法以推动新型水印技术的发展。 Method: 提出一种基于扩散模型的图像再生方法和引导扩散攻击策略，利用扩散过程在生成图像时显式地抑制水印信号，并从理论上分析扩散变换对水印互信息的影响。 Result: 在StegaStamp、TrustMark和VINE等多种最先进水印方案上实现了接近零的水印恢复率，同时保持了较高的图像视觉保真度；理论证明了充分扩散后水印信息趋于消失。 Conclusion: 当前鲁棒水印技术在生成模型攻击下面临根本性脆弱性，需发展能抵御生成式攻击的新一代水印方法。 Abstract: Robust invisible watermarking aims to embed hidden information into images such that the watermark can survive various image manipulations. However, the rise of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we present a theoretical study and method demonstrating that diffusion models can effectively break robust image watermarks that were designed to resist conventional perturbations. We show that a diffusion-driven ``image regeneration'' process can erase embedded watermarks while preserving perceptual image content. We further introduce a novel guided diffusion attack that explicitly targets the watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion-based transformation, the mutual information between the watermarked image and the embedded watermark payload vanishes, resulting in decoding failure. Experimentally, we evaluate our approach on multiple state-of-the-art watermarking schemes (including the deep learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings highlight a fundamental vulnerability in current robust watermarking techniques against generative model-based attacks, underscoring the need for new watermarking strategies in the era of generative AI.

[145] Detection and Measurement of Hailstones with Multimodal Large Language Models

Moritz Alker,David C. Schedl,Andreas Stöckl

Main category: cs.CV

TL;DR: 本研究利用预训练的多模态大语言模型，通过社交媒体和新闻图片检测与测量冰雹大小，使用了474张来自奥地利2022年至2024年冰雹事件的众包图像，冰雹直径范围为2至11厘米。研究比较了四种模型及单阶段与双阶段提示策略，发现最佳模型的平均绝对误差为1.12厘米，双阶段提示提高了多数模型的可靠性，表明无需微调的现成模型即可从图像中提取有意义的信息，补充传统冰雹传感器，实现对强天气事件的快速、详细评估。

Details

Motivation: 传统冰雹监测依赖有限的传感器网络，难以获取空间密集且实时的数据。社交媒体和新闻图片提供了广泛覆盖和即时性，但缺乏自动化分析手段。因此，探索利用现成多模态大模型从这些图像中自动测量冰雹尺寸的方法，有助于提升极端天气事件的响应速度与评估精度。 Method: 采用474张标注冰雹事件的众包图像，基于预训练的多模态大语言模型，设计并比较了四种模型在单阶段和双阶段提示策略下的表现。双阶段策略引入图像中的参照物（如人手）作为尺寸线索，以提高测量准确性。评估指标包括平均绝对误差（MAE）。 Result: 最佳模型的平均绝对误差为1.12厘米；双阶段提示策略相比单阶段提升了多数模型的可靠性。结果表明，无需微调的现成多模态模型已具备从图像中估计冰雹尺寸的能力。 Conclusion: 预训练的多模态大语言模型可有效用于从社交媒体图像中测量冰雹尺寸，能补充传统传感器，提供更密集的空间信息和更快的灾情评估。未来结合自动化实时图像采集，该方法可直接应用于实际冰雹事件监测。 Abstract: This study examines the use of social media and news images to detect and measure hailstones, utilizing pre-trained multimodal large language models. The dataset for this study comprises 474 crowdsourced images of hailstones from documented hail events in Austria, which occurred between January 2022 and September 2024. These hailstones have maximum diameters ranging from 2 to 11cm. We estimate the hail diameters and compare four different models utilizing one-stage and two-stage prompting strategies. The latter utilizes additional size cues from reference objects, such as human hands, within the image. Our results show that pretrained models already have the potential to measure hailstone diameters from images with an average mean absolute error of 1.12cm for the best model. In comparison to a single-stage prompt, two-stage prompting improves the reliability of most models. Our study suggests that these off-the-shelf models, even without fine-tuning, can complement traditional hail sensors by extracting meaningful and spatially dense information from social media imagery, enabling faster and more detailed assessments of severe weather events. The automated real-time image harvesting from social media and other sources remains an open task, but it will make our approach directly applicable to future hail events.

[146] Continual Learning for Image Captioning through Improved Image-Text Alignment

Bertram Taetz,Gal Bordelius

Main category: cs.CV

TL;DR: 提出了一种用于持续图像描述的多损失框架，通过基于提示的持续学习和对比对齐来缓解灾难性遗忘并提升语义对齐性能。

Details

Motivation: 在持续学习场景中，由于灾难性遗忘以及随时间推移视觉概念与语言对齐困难，生成准确且连贯的图像描述仍具挑战性。 Method: 基于预训练的ViT-GPT-2模型，结合交叉熵损失与三种额外损失：基于提示的余弦相似性损失、CLIP风格的图像-文本对齐损失，以及语言引导的对比损失（三元组损失），实现语义引导的持续学习。 Result: 该方法有效缓解了灾难性遗忘，在语义描述对齐方面优于现有最先进方法，且推理时无额外开销，生成描述时不需提示。 Conclusion: 所提出的多损失框架在持续图像描述任务中表现出优越的性能，兼顾模型稳定性与语义一致性，为未来持续学习与多模态对齐研究提供了有效方案。 Abstract: Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

[147] Emergent AI Surveillance: Overlearned Person Re-Identification and Its Mitigation in Law Enforcement Context

An Thi Nguyen,Radina Stoykova,Eric Arazo

Main category: cs.CV

TL;DR: 研究表明，通用实例搜索模型在非人类数据集上训练后仍可能通过过学习识别特定个体，引发隐私和识别风险；尽管指数排除和混淆损失等技术可降低再识别准确率至2%以下并保留82%的非人物检索性能，但存在被部分人体图像绕过的漏洞，凸显AI治理与数据保护中的监管空白。

Details

Motivation: 通用实例搜索模型虽能减少犯罪调查中监控视频分析的人工负担，但其意外获得的个体识别能力可能侵犯隐私，而目前缺乏有效的去标识化标准，亟需研究其风险与缓解措施。 Method: 评估了两种抑制模型人物再识别能力的技术：索引排除（index exclusion）和混淆损失（confusion loss），并在实验中测试其对人物识别准确率和非人物检索性能的影响，同时探究部分人体图像是否可绕过这些防护机制。 Result: 结合索引排除与混淆损失可将人物再识别准确率降至2%以下，同时保持82%的非人物对象检索性能，但发现部分人体图像可能绕过这些防护，暴露其安全漏洞。 Conclusion: 通用实例搜索模型可能无意中具备个体识别能力，现有技术缓解手段存在局限性，亟需建立相应的技术标准与监管框架，以防止看似无害的AI应用发展出敏感的识别功能。 Abstract: Generic instance search models can dramatically reduce the manual effort required to analyze vast surveillance footage during criminal investigations by retrieving specific objects of interest to law enforcement. However, our research reveals an unintended emergent capability: through overlearning, these models can single out specific individuals even when trained on datasets without human subjects. This capability raises concerns regarding identification and profiling of individuals based on their personal data, while there is currently no clear standard on how de-identification can be achieved. We evaluate two technical safeguards to curtail a model's person re-identification capacity: index exclusion and confusion loss. Our experiments demonstrate that combining these approaches can reduce person re-identification accuracy to below 2% while maintaining 82% of retrieval performance for non-person objects. However, we identify critical vulnerabilities in these mitigations, including potential circumvention using partial person images. These findings highlight urgent regulatory questions at the intersection of AI governance and data protection: How should we classify and regulate systems with emergent identification capabilities? And what technical standards should be required to prevent identification capabilities from developing in seemingly benign applications?

[148] Universal Neural Architecture Space: Covering ConvNets, Transformers and Everything in Between

Ondřej Týbl,Lukáš Neumann

Main category: cs.CV

TL;DR: 提出通用神经架构空间UniNAS，统一卷积网络、Transformer及其混合架构，支持发现新架构和分析现有架构，并通过新搜索算法找到性能超越手工设计模型的结构，同时提供统一工具包促进可复现性和公平比较。

Details

Motivation: 为了系统化探索神经架构的全谱系，需要一个能统一不同类型神经网络（如CNN、Transformer）的灵活搜索空间，以促进架构创新和公平评估。 Method: 构建一个名为UniNAS的通用搜索空间，将卷积网络、Transformer及其混合架构统一在一个基于图的框架下，并设计新的搜索算法来遍历该空间，同时引入标准化训练与评估协议。 Result: 在相同训练设置下，UniNAS发现的架构优于当前最先进的手工设计模型，验证了搜索空间的有效性和潜力。 Conclusion: UniNAS为神经架构搜索提供了统一且灵活的框架，推动了NAS研究的系统化探索和可复现性发展。 Abstract: We introduce Universal Neural Architecture Space (UniNAS), a generic search space for neural architecture search (NAS) which unifies convolutional networks, transformers, and their hybrid architectures under a single, flexible framework. Our approach enables discovery of novel architectures as well as analyzing existing architectures in a common framework. We also propose a new search algorithm that allows traversing the proposed search space, and demonstrate that the space contains interesting architectures, which, when using identical training setup, outperform state-of-the-art hand-crafted architectures. Finally, a unified toolkit including a standardized training and evaluation protocol is introduced to foster reproducibility and enable fair comparison in NAS research. Overall, this work opens a pathway towards systematically exploring the full spectrum of neural architectures with a unified graph-based NAS perspective.

[149] VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

Xinye Cao,Hongcan Guo,Jiawen Qian,Guoshun Nan,Chao Wang,Yuqi Pan,Tianhao Hou,Xiaojuan Wang,Yutong Gao

Main category: cs.CV

TL;DR: 本文提出VideoMiner，通过迭代分割、描述和聚类长视频，构建层次化树结构，并引入基于树的组相对策略优化（T-GRPO）方法，实现对长视频中关键帧的精准定位，在多模态大模型长视频理解任务中实现了更优性能。

Details

Motivation: 现有长视频理解方法在处理冗余信息和动态适应复杂层次结构方面存在挑战，难以有效提取关键帧。本文旨在减少冗余信息干扰，并提升模型对层次结构的自适应能力。 Method: 提出VideoMiner框架，将长视频逐步分解为事件和帧，形成树状结构；并设计T-GRPO强化学习方法，结合时空信息与问题引导，在树结构上进行策略优化，实现关键帧的精准定位。 Result: 在所有长视频理解任务中均取得更优性能，T-GRPO能自发激励模型生成推理链，所设计的树生长auxin机制可动态调整扩展深度，兼顾准确率与效率。 Conclusion: VideoMiner结合层次化建模与强化学习策略，在处理长视频时有效缓解了信息冗余问题，并实现了对复杂结构的自适应，显著提升了多模态大模型的长视频理解能力。 Abstract: Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.

[150] GLVD: Guided Learned Vertex Descent

Pol Caselles Rico,Francesc Moreno Noguer

Main category: cs.CV

TL;DR: 提出GLVD方法，结合每顶点神经场优化和动态预测的3D关键点，实现高效且高质量的少样本3D人脸重建。

Details

Motivation: 现有3D人脸建模方法受限于固定的形状先验，而基于优化的方法虽质量高但计算成本大。 Method: 扩展Learned Vertex Descent（LVD），引入每顶点神经场优化，并通过动态预测的3D关键点提供全局结构引导，结合相对空间编码迭代优化网格顶点。 Result: 在单视图设置下达到SOTA性能，多视图场景中表现竞争力，显著降低推理时间。 Conclusion: GLVD在保持高效计算的同时，实现了高质量、灵活的3D人脸几何重建。 Abstract: Existing 3D face modeling methods usually depend on 3D Morphable Models, which inherently constrain the representation capacity to fixed shape priors. Optimization-based approaches offer high-quality reconstructions but tend to be computationally expensive. In this work, we introduce GLVD, a hybrid method for 3D face reconstruction from few-shot images that extends Learned Vertex Descent (LVD) by integrating per-vertex neural field optimization with global structural guidance from dynamically predicted 3D keypoints. By incorporating relative spatial encoding, GLVD iteratively refines mesh vertices without requiring dense 3D supervision. This enables expressive and adaptable geometry reconstruction while maintaining computational efficiency. GLVD achieves state-of-the-art performance in single-view settings and remains highly competitive in multi-view scenarios, all while substantially reducing inference time.

[151] Medical Vision Language Models as Policies for Robotic Surgery

Akshay Muppidi,Martin Radfar

Main category: cs.CV

TL;DR: 提出一种结合MedFlamingo与PPO的新方法，用于基于视觉的机器人腹腔镜手术任务，在LapGym五个任务中显著优于基线方法，成功率均超70%，提升幅度达66.67%至1114.29%。

Details

Motivation: 视觉PPO在腹腔镜手术任务中面临高维视觉输入、稀疏奖励和难以提取任务相关特征的问题，限制了其应用。 Method: 将医学领域专用的视觉语言模型MedFlamingo与PPO结合，利用MedFlamingo在每回合中处理任务观察和指令，生成高层规划token，融入强化学习策略。 Result: 在仅使用内窥镜视觉观测的五个手术任务中，MedFlamingo PPO相比标准视觉PPO和OpenFlamingo PPO收敛更快、性能更优，任务成功率均超过70%，相对提升66.67%到1114.29%。 Conclusion: 引入医学领域先验知识（如MedFlamingo）可有效提升视觉强化学习在复杂手术任务中的效率与性能，凸显专业医疗知识在机器人手术决策中的价值。 Abstract: Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.

[152] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA

Python Song,Luke Tenyi Chang,Yun-Yun Tsai,Penghui Li,Junfeng Yang

Main category: cs.CV

TL;DR: 本文提出CAPTCHA-X，首个包含推理步骤的真实世界验证码基准，用于评估视觉语言模型的空间推理能力。研究表明，引入逐步推理可显著提升模型解决验证码的准确率，所提方法在五种高难度验证码上达到83.9%的平均准确率，远超现有基线。

Details

Motivation: 当前视觉语言模型在处理高难度空间推理任务（如验证码）时表现不佳，缺乏对推理过程的有效利用，亟需系统性评估和改进方法。 Method: 构建包含七类验证码及逐步操作标注的CAPTCHA-X基准；设计五项面向推理的评估指标；提出一种通用的基于代理的视觉语言模型框架，显式引入逐步推理机制以提升性能。 Result: 商业VLMs在无推理情况下准确率仅为21.9%；引入逐步推理后，所提方法在五种高难度验证码上平均准确率达到83.9%，显著优于现有方法。 Conclusion: 逐步推理对解决复杂视觉空间任务至关重要；CAPTCHA-X为评估和提升模型推理能力提供了有效基准，凸显了未来在视觉推理方向的发展重点。 Abstract: CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.

[153] There is More to Attention: Statistical Filtering Enhances Explanations in Vision Transformers

Meghna P Ayyar,Jenny Benois-Pineau,Akka Zemmari

Main category: cs.CV

TL;DR: 提出一种结合注意力图和统计过滤的方法，用于生成更清晰、可解释性更强的Vision Transformers解释，且在多个数据集上表现优于或媲美现有方法。

Details

Motivation: 现有的Vision Transformer解释方法依赖注意力权重，容易产生噪声，且难以迁用CNN中的解释方法，需要更有效的解释手段。 Method: 结合注意力图与最初为CNN提出的统计过滤技术，去除噪声或无信息模式，并扩展了类别特定变体以生成判别性解释。 Result: 生成的解释图更清晰、更可读，在基于扰动的保真度指标和人类眼动数据评估中均表现优异，且计算高效。 Conclusion: 该方法能有效提升Vision Transformer的可解释性，兼顾模型忠实性和人类可理解性，是XAI中具有潜力的解决方案。 Abstract: Explainable AI (XAI) has become increasingly important with the rise of large transformer models, yet many explanation methods designed for CNNs transfer poorly to Vision Transformers (ViTs). Existing ViT explanations often rely on attention weights, which tend to yield noisy maps as they capture token-to-token interactions within each layer.While attribution methods incorporating MLP blocks have been proposed, we argue that attention remains a valuable and interpretable signal when properly filtered. We propose a method that combines attention maps with a statistical filtering, initially proposed for CNNs, to remove noisy or uninformative patterns and produce more faithful explanations. We further extend our approach with a class-specific variant that yields discriminative explanations. Evaluation against popular state-of-the-art methods demonstrates that our approach produces sharper and more interpretable maps. In addition to perturbation-based faithfulness metrics, we incorporate human gaze data to assess alignment with human perception, arguing that human interpretability remains essential for XAI. Across multiple datasets, our approach consistently outperforms or is comparable to the SOTA methods while remaining efficient and human plausible.

[154] When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Mi Luo,Zihui Xue,Alex Dimakis,Kristen Grauman

Main category: cs.CV

TL;DR: 本文研究了链式思维（CoT）在视频推理中的局限性，提出“视觉思维漂移”现象，并引入基于视觉证据奖励（VER）的强化学习框架以提升多步视觉推理的准确性。

Details

Motivation: 尽管链式思维（CoT）在文本推理中表现良好，但在视频理解中常导致性能下降，产生误导性推理和幻觉细节，因此需要系统分析其问题并提出针对性解决方案。 Method: 通过贝叶斯视角分析CoT在视频推理中的“视觉思维漂移”现象，并提出Visual Evidence Reward（VER）框架，利用强化学习鼓励模型生成与视觉证据一致的推理路径。 Result: 在10个多样化的视频理解基准上验证了Video-VER的有效性， consistently achieving state-of-the-art performance. Conclusion: 视频推理需要模型在思考过程中始终 grounded 在视觉证据上，VER框架有效缓解了CoT带来的思维漂移问题，推动了具备真实视觉感知能力的多模态AI发展。 Abstract: Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".

[155] A public cardiac CT dataset featuring the left atrial appendage

Bjoern Hansen,Jonas Pedersen,Klaus F. Kofoed,Oscar Camara,Rasmus R. Paulsen,Kristine Soerensen

Main category: cs.CV

TL;DR: 本文提出了首个开源的、解剖结构一致的高分辨率左心耳（LAA）、冠状动脉（CAs）和肺静脉（PVs）分割数据集，基于1000例心脏CT血管造影（CCTA）扫描，旨在推动LAA形态学分析的研究。

Details

Motivation: 尽管现有分割框架（如TotalSegmentator）取得了成功，但在医学影像中对LAA、CAs和PVs的精确分割仍具挑战性，限制了相关形态学研究的发展。 Method: 利用专为高分辨率LAA分割设计的先进分割框架，在包含人工标注的大规模私有数据集上训练模型，并迁移到公开的ImageCAS数据集；改进原始CAs标注并优化TS生成的PV分割结果，同时提供存在数据缺陷的扫描列表。 Result: 发布了包含高质量LAA、CAs和PVs分割标签的数据集，补充了TotalSegmentator的全心标签，并识别出ImageCAS中存在步进伪影、LAA超出视野等常见缺陷的扫描样本。 Conclusion: 该数据集为LAA形态学分析及心血管疾病研究提供了可靠资源，有助于推动针对复杂心脏结构的新型分割与分析方法的发展。 Abstract: Despite the success of advanced segmentation frameworks such as TotalSegmentator (TS), accurate segmentations of the left atrial appendage (LAA), coronary arteries (CAs), and pulmonary veins (PVs) remain a significant challenge in medical imaging. In this work, we present the first open-source, anatomically coherent dataset of curated, high-resolution segmentations for these structures, supplemented with whole-heart labels produced by TS on the publicly available ImageCAS dataset consisting of 1000 cardiac computed tomography angiography (CCTA) scans. One purpose of the data set is to foster novel approaches to the analysis of LAA morphology. LAA segmentations on ImageCAS were generated using a state-of-the-art segmentation framework developed specifically for high resolution LAA segmentation. We trained the network on a large private dataset with manual annotations provided by medical readers guided by a trained cardiologist and transferred the model to ImageCAS data. CA labels were improved from the original ImageCAS annotations, while PV segmentations were refined from TS outputs. In addition, we provide a list of scans from ImageCAS that contains common data flaws such as step artefacts, LAAs extending beyond the scanner's field of view, and other types of data defects.

[156] Compact Multi-level-prior Tensor Representation for Hyperspectral Image Super-resolution

Yinjian Wang,Wei Li,Yuanyuan Gui,Gemine Vivone

Main category: cs.CV

TL;DR: 提出了一种新的基于张量的高光谱超分辨率模型，通过块分解和非凸模式混洗张量相关全变分方法有效融合多级先验信息，实现了高效优化并验证了其在多个数据集上的有效性。

Details

Motivation: 现有基于张量的融合模型难以同时有效利用多级先验信息，导致模型复杂度高、权重平衡困难，因此需要一种能紧凑表达多级先验且易于优化的新模型。 Method: 将潜在高分辨率图像通过块项分解分离光谱低秩性和空间先验；构建空间张量以编码高阶空间低秩性和平滑性，并通过提出的非凸模式混洗张量相关全变分进行联合建模；采用线性化ADMM算法进行高效优化。 Result: 在多个数据集上实验表明，所提方法在融合性能上优于现有方法，且算法具有良好的收敛性。 Conclusion: 该方法成功整合了多级先验信息，在保持模型紧凑的同时提升了高光谱图像超分辨率的精度和效率，具有良好的应用前景。 Abstract: Fusing a hyperspectral image with a multispectral image acquired over the same scene, \textit{i.e.}, hyperspectral image super-resolution, has become a popular computational way to access the latent high-spatial-spectral-resolution image. To date, a variety of fusion methods have been proposed, among which the tensor-based ones have testified that multiple priors, such as multidimensional low-rankness and spatial total variation at multiple levels, effectively drive the fusion process. However, existing tensor-based models can only effectively leverage one or two priors at one or two levels, since simultaneously incorporating multi-level priors inevitably increases model complexity. This introduces challenges in both balancing the weights of different priors and optimizing multi-block structures. Concerning this, we present a novel hyperspectral super-resolution model compactly characterizing these multi-level priors of hyperspectral images within the tensor framework. Firstly, the proposed model decouples the spectral low-rankness and spatial priors by casting the latent high-spatial-spectral-resolution image into spectral subspace and spatial maps via block term decomposition. Secondly, these spatial maps are stacked as the spatial tensor encoding the high-order spatial low-rankness and smoothness priors, which are co-modeled via the proposed non-convex mode-shuffled tensor correlated total variation. Finally, we draw inspiration from the linearized alternating direction method of multipliers to design an efficient algorithm to optimize the resulting model, theoretically proving its Karush-Kuhn-Tucker convergence under mild conditions. Experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm. The code implementation will be available from https://github.com/WongYinJ.

[157] Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction

Shuo Jiang,Zhuwen Chen,Liaoman Xu,Yanming Zhu,Changmiao Wang,Jiong Zhang,Feiwei Qin,Yifei Chen,Zhu Zhu

Main category: cs.CV

TL;DR: 提出了一种基于原型的多模态框架FeatProto，用于癌症生存预测，通过整合全切片图像和基因组数据的全局与局部特征，提升模型可解释性和准确性。

Details

Motivation: 现有生存分析模型难以解释，且传统原型学习方法忽略肿瘤整体上下文，缺乏与基因组数据的语义对齐。 Method: 构建统一的特征原型空间，融合WSI的全局与局部特征及基因组数据；采用指数原型更新策略（EMA ProtoUp）和分层原型匹配机制。 Result: 在四个公开癌症数据集上超越当前最先进的单模态和多模态生存预测方法，在准确性和可解释性方面表现更优。 Conclusion: FeatProto有效提升了癌症生存预测的准确性与可解释性，为医学应用中的原型学习提供了新视角。 Abstract: Survival analysis plays a vital role in making clinical decisions. However, the models currently in use are often difficult to interpret, which reduces their usefulness in clinical settings. Prototype learning presents a potential solution, yet traditional methods focus on local similarities and static matching, neglecting the broader tumor context and lacking strong semantic alignment with genomic data. To overcome these issues, we introduce an innovative prototype-based multimodal framework, FeatProto, aimed at enhancing cancer survival prediction by addressing significant limitations in current prototype learning methodologies within pathology. Our framework establishes a unified feature prototype space that integrates both global and local features of whole slide images (WSI) with genomic profiles. This integration facilitates traceable and interpretable decision-making processes. Our approach includes three main innovations: (1) A robust phenotype representation that merges critical patches with global context, harmonized with genomic data to minimize local bias. (2) An Exponential Prototype Update Strategy (EMA ProtoUp) that sustains stable cross-modal associations and employs a wandering mechanism to adapt prototypes flexibly to tumor heterogeneity. (3) A hierarchical prototype matching scheme designed to capture global centrality, local typicality, and cohort-level trends, thereby refining prototype inference. Comprehensive evaluations on four publicly available cancer datasets indicate that our method surpasses current leading unimodal and multimodal survival prediction techniques in both accuracy and interoperability, providing a new perspective on prototype learning for critical medical applications. Our source code is available at https://github.com/JSLiam94/FeatProto.

[158] Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework

Mosong Ma,Tania Stathaki,Michalis Lazarou

Main category: cs.CV

TL;DR: SSGNet是一种结合类特定生成模型和半监督伪标签的统一框架，用于增强医学图像分类与分割，缓解标注数据稀缺问题。

Details

Motivation: 医学图像中常因标注数据稀缺且不平衡而限制深度学习的应用，需要有效方法提升模型性能。 Method: 提出SSGNet框架，利用StyleGAN3生成类特定图像扩充训练数据，并通过迭代式半监督伪标签优化标签质量，从而增强现有模型。 Result: 在多个医学图像基准上实验显示分类与分割性能稳定提升，FID分析证明生成样本质量高。 Conclusion: SSGNet是一种实用且有效的策略，可缓解医学图像分析中的标注瓶颈并提高模型鲁棒性。 Abstract: Deep learning in medical imaging is often limited by scarce and imbalanced annotated data. We present SSGNet, a unified framework that combines class specific generative modeling with iterative semisupervised pseudo labeling to enhance both classification and segmentation. Rather than functioning as a standalone model, SSGNet augments existing baselines by expanding training data with StyleGAN3 generated images and refining labels through iterative pseudo labeling. Experiments across multiple medical imaging benchmarks demonstrate consistent gains in classification and segmentation performance, while Frechet Inception Distance analysis confirms the high quality of generated samples. These results highlight SSGNet as a practical strategy to mitigate annotation bottlenecks and improve robustness in medical image analysis.

[159] Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Jiawei Mao,Yuhan Wang,Lifeng Chen,Can Zhao,Yucheng Tang,Dong Yang,Liangqiong Qu,Daguang Xu,Yuyin Zhou

Main category: cs.CV

TL;DR: MeDiM是首个跨模态的医疗离散扩散模型，无需模态特定组件即可统一生成医学图像与报告，支持高保真、协调的多模态输出。

Details

Motivation: 现有生成式医疗模型受限于模态孤立，难以整合影像、病理和临床文本等互补信息，阻碍了向通用医学基础模型的发展。 Method: 提出MeDiM，基于离散扩散框架，使用多模态大语言模型（MLLM）作为扩散主干，去除因果注意力掩码以实现双向上下文，并引入连续时间步嵌入增强扩散感知，实现跨模态共享分布建模。 Result: 在MIMIC-CXR和PathGen上分别取得FID 16.60和24.19的生成质量，在报告生成任务中METEOR达0.2650；联合生成的图文对显著提升下游性能（如BLEU-3提升31.58%）。 Conclusion: MeDiM实现了无需模态特定设计的统一医疗生成，支持连贯且符合临床逻辑的跨模态生成，推动医学基础模型发展。 Abstract: Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

[160] Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Zanyi Wang,Dengyang Jiang,Liuzhuozheng Li,Sizhe Dang,Chengzu Li,Harry Yang,Guang Dai,Mengmeng Wang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出FlowRVS，一种将指代表情视频对象分割（RVOS）重构为条件连续流问题的新框架，通过语言引导的视频整体表征到目标掩码的直接变形，实现了在多个基准上的最先进性能。

Details

Motivation: 现有RVOS方法多采用“先定位后分割”的级联 pipeline，存在语义信息瓶颈和时序不一致问题，难以有效将语言描述与视频像素对齐并保持时间连贯性。 Method: 提出FlowRVS，将RVOS视为条件连续流问题，利用预训练T2V模型，学习从视频整体表征到目标掩码的语言引导直接变形，实现细粒度像素控制、文本-视频语义对齐以及时序一致性。 Result: 在MeViS上达到51.1的J&F分数（+1.6超越先前SOTA），在零样本Ref-DAVIS17上达到73.3（+2.7），显著优于现有方法。 Conclusion: 将视频理解任务建模为连续变形过程具有巨大潜力，FlowRVS通过一阶段生成式方法解决了传统方法的信息瓶颈和时序解耦问题，推动了RVOS的发展。 Abstract: Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $\mathcal{J}\&\mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

[161] Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

Aditya Prakash,David Forsyth,Saurabh Gupta

Main category: cs.CV

TL;DR: 本文提出了一种从单张图像预测日常场景中双手3D运动和姿态的方法，利用扩散模型将2D关键点序列提升为4D手部运动，并采用扩散损失来建模手部运动的多模态分布。

Details

Motivation: 缺乏多样化场景下的3D手部标注数据，限制了现有方法在日常图像中的泛化能力。 Method: 设计了一个基于扩散模型的标注流程，将2D手部关键点序列升维至4D手部运动；采用扩散损失训练手部运动预测模型，以捕捉手部动作的多模态特性。 Result: 在6个数据集上的实验表明，使用合成标签训练的数据使性能提升14%；提出的提升模型比基线好42%，预测模型增益达16.4%，尤其在零样本泛化到日常图像时表现突出。 Conclusion: 通过扩散模型生成高质量4D手部运动标签并结合扩散损失进行预测，显著提升了从单图预测双手3D动作的性能，特别是在未见场景中的泛化能力。 Abstract: We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

[162] ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Jiraphon Yenphraphai,Ashkan Mirzaei,Jianqi Chen,Jiaxu Zou,Sergey Tulyakov,Raymond A. Yeh,Peter Wonka,Chaoyang Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练3D模型的端到端视频到4D形状生成框架，通过引入时间注意力、时序感知采样与4D潜在锚定以及跨帧噪声共享，实现了对非刚性运动、体积变化和拓扑变换的准确建模。

Details

Motivation: 现有方法难以从单个视频中直接生成时间一致且细节丰富的动态3D形状，通常依赖逐帧优化，导致时间不稳定和几何不连贯。因此，需要一种能端到端生成高质量4D形状的新框架。 Method: 基于大规模预训练3D模型，提出三个关键组件：(i) 时间注意力机制，利用所有视频帧条件化生成时变表示；(ii) 时序感知点采样与4D潜在锚定，提升几何与纹理的时间一致性；(iii) 跨帧噪声共享策略，增强生成结果的时间稳定性。 Result: 在多种真实场景视频上验证了方法的有效性，相比基线方法显著提升了生成结果的鲁棒性、感知质量，并减少了失败情况，能够准确捕捉非刚性运动、体积变化和拓扑结构变化。 Conclusion: 所提出框架实现了从视频到动态3D形状的高质量、端到端生成，无需逐帧优化，在时间一致性与细节保真方面优于现有方法。 Abstract: Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

[163] Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models

Jiahao Wang,Zhenpei Yang,Yijing Bai,Yingwei Li,Yuliang Zou,Bo Sun,Abhijit Kundu,Jose Lezama,Luna Yue Huang,Zehao Zhu,Jyh-Jing Hwang,Dragomir Anguelov,Mingxing Tan,Chiyu Max Jiang

Main category: cs.CV

TL;DR: 本文提出Drive&Gen框架，结合生成式视频模型与端到端驾驶模型，通过可控虚拟环境评估生成视频的真实性，并利用合成数据提升驾驶模型在分布外场景的泛化能力。

Details

Motivation: 现有生成模型能否生成足够真实且条件可控的视频用于端到端自动驾驶系统评估尚不明确，同时缺乏对端到端驾驶模型偏见和泛化能力的深入理解。 Method: 提出新的统计指标，利用端到端驾驶模型评估生成视频的真实性；利用视频生成模型的可控性进行针对性实验，分析影响驾驶模型性能的分布差距；使用生成模型产生的合成数据增强训练以提升泛化能力。 Result: 验证了生成视频在特定条件下的真实性可被量化评估；揭示了影响端到端驾驶模型性能的关键分布差距；证明合成数据能有效提升模型在新操作场景中的泛化能力。 Conclusion: Drive&Gen框架成功连接了生成式世界模型与自动驾驶模型，为自动驾驶系统的测试、分析与训练提供了高效、低成本的解决方案，有助于推动其在新环境中的部署。 Abstract: Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

[164] Fine-grained Defocus Blur Control for Generative Image Models

Ayush Shrivastava,Connelly Barnes,Xuaner Zhang,Lingzhi Zhang,Andrew Owens,Sohrab Amirghodsi,Eli Shechtman

Main category: cs.CV

TL;DR: 提出一种基于EXIF数据的文本到图像扩散模型，通过模拟物理成像过程实现对镜头模糊的精细控制。

Details

Motivation: 现有文本到图像扩散模型难以融入细粒度的相机元数据（如光圈设置），缺乏对景深效果的精确控制。 Method: 首先生成全焦图像，估计单目深度，使用新型焦点距离Transformer预测合理的对焦距离，再通过可微分镜头模糊模型生成虚化图像，并通过端到端训练使梯度贯穿整个过程。 Result: 模型在无需显式监督的情况下学会根据内容和EXIF数据生成虚化效果，推理时可交互式精确控制模糊程度而不改变场景内容。 Conclusion: 该方法实现了比现有扩散模型更精细的虚化控制能力，同时保持生成场景的一致性与高质量。 Abstract: Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.

[165] Dropping the D: RGB-D SLAM Without the Depth Sensor

Mert Kiray,Alican Karaomer,Benjamin Busam

Main category: cs.CV

TL;DR: DropD-SLAM是一个实时单目SLAM系统，通过三个预训练视觉模块替代深度传感器，实现RGB-D级别的精度。

Details

Motivation: 在不依赖主动深度传感器的情况下，实现高精度、实时的度量级SLAM，降低成本和系统复杂性。 Method: 使用单目度量深度估计器、学习型关键点检测器和实例分割网络，通过膨胀实例掩码抑制动态物体，并将静态关键点与预测深度结合，输入标准RGB-D SLAM后端进行跟踪与建图。 Result: 在TUM RGB-D数据集上，静态序列平均ATE为7.4 cm，动态序列为1.8 cm，性能达到或超过当前最先进的RGB-D方法，单GPU运行速度达22 FPS。 Conclusion: 现代预训练视觉模型可有效替代主动深度传感器，提供可靠的度量尺度，推动更简单、低成本SLAM系统的发展。 Abstract: We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

[166] EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang,Yuqian Fu,Runyi Yang,Yang Miao,Tianwen Qian,Xu Zheng,Guolei Sun,Ajad Chhatkuli,Xuanjing Huang,Yu-Gang Jiang,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 本文提出了EgoNight，首个面向夜间第一人称视觉的综合基准，核心任务为视觉问答（VQA），并引入昼夜对齐视频以提升夜间标注质量，揭示光照条件对模型性能的影响。

Details

Motivation: 现有第一人称视觉理解基准多关注白天场景，忽略了实际应用中不可避免的低光环境，缺乏对夜间视觉理解的有效评估。 Method: 收集了Blender渲染的合成视频和真实世界录制的昼夜对齐视频，构建EgoNight-VQA数据集，并开发了日间增强夜间自动标注引擎，结合大量人工验证进行标注精炼。 Result: EgoNight-VQA包含90段视频中的3658个问答对，涵盖12种不同类型的问答，评估显示现有最先进多模态大模型在从白天转移到夜间时性能显著下降。此外还提出两个辅助任务：昼夜对应检索与夜间深度估计。 Conclusion: EgoNight-VQA为推动面向实际应用的、跨光照域泛化的第一人称视觉研究提供了坚实基础。 Abstract: Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

[167] Human3R: Everyone Everywhere All at Once

Yue Chen,Xingyu Chen,Yuxuan Xue,Anpei Chen,Yuliang Xiu,Gerard Pons-Moll

Main category: cs.CV

TL;DR: Human3R是一个统一的、前馈框架，用于从单目视频中实时进行4D人类-场景重建，能够在单次前向传播中同时恢复多人SMPL-X身体、3D场景和相机轨迹，具有高效性和低内存占用。

Details

Motivation: 现有方法依赖多阶段流水线、迭代优化和大量预处理模块（如人体检测、深度估计和SLAM），复杂且效率低。Human3R旨在构建一个统一、高效的端到端模型，消除对这些重依赖的需求。 Method: 基于CUT3R模型，采用参数高效的视觉提示调优（visual prompt tuning）技术，在保留其时空先验的同时直接输出多个SMPL-X人体参数，并实现全局场景与人体的一体化重建。 Result: 在仅使用小型合成数据集BEDLAM训练一天的情况下，Human3R实现了15 FPS的实时性能和8GB显存占用，并在人体运动估计、人体网格重建、深度估计和相机位姿估计等多个任务上达到SOTA或具有竞争力的结果。 Conclusion: Human3R提供了一个简洁而强大的统一模型基准，能够高效地完成在线4D人-景重建，易于扩展至下游应用。 Abstract: We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R

Table of Contents

cs.CL [Back]

[1] Collaborative and Proactive Management of Task-Oriented Conversations

[2] Trainable Reference-Based Evaluation Metric for Identifying Quality of English-Gujarati Machine Translation System

[3] Hallucination is Inevitable for LLMs with the Open World Assumption

[4] Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models

[5] CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

[6] MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

[7] Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

[8] Improving Metacognition and Uncertainty Communication in Language Models

[9] Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models

[10] Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models

[11] Submodular Context Partitioning and Compression for In-Context Learning-short paper

[12] Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery

[13] Training Large Language Models To Reason In Parallel With Global Forking Tokens

[14] Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

[15] Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

[16] Linguistic Characteristics of AI-Generated Text: A Survey

[17] Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

[18] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

[19] NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

[20] To model human linguistic prediction, make LLMs less superhuman

[21] Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

[22] SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation

[23] Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs

[24] Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

[25] Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA

[26] A Single Character can Make or Break Your LLM Evals

[27] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

[28] A novel hallucination classification framework

[29] Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

[30] Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

[31] RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

[32] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

[33] Residualized Similarity for Faithfully Explainable Authorship Verification

[34] The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

[35] Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

[36] Cross-Lingual Mental Health Ontologies for Indian Languages: Bridging Patient Expression and Clinical Understanding through Explainable AI and Human-in-the-Loop Validation

[37] Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care

[38] A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

[39] Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

[40] SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

[41] AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering

[42] SocialNLI: A Dialogue-Centric Social Inference Dataset

[43] TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

[44] Language Model as Planner and Formalizer under Constraints

[45] LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation

[46] Prototype-Based Dynamic Steering for Large Language Models

[47] CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

[48] KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

[49] H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

[50] On the Role of Difficult Prompts in Self-Play Preference Optimization

[51] Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

[52] Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

[53] Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs

[54] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

[55] MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction

[56] The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

[57] Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models

[58] DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

[59] Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

[60] Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

[61] InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience

[62] Mixture of Neuron Experts

[63] Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

[64] EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

[65] Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer

[66] DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

[67] Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies

[68] Revisiting Long-context Modeling from Context Denoising Perspective

[69] Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

[70] The fragility of "cultural tendencies" in LLMs

[71] Prompt reinforcing for long-term planning of large language models

[72] Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

[73] EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

[74] Probing the Difficulty Perception Mechanism of Large Language Models

[75] LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

[76] Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

[77] MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

[78] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG