Skip to content

Table of Contents

cs.CL [Back]

[1] Talking to Yourself: Defying Forgetting in Large Language Models

Yutao Sun,Mingshuai Chen,Tiancheng Zhao,Phillip Miao,Zilun Zhang,Haozhan Shen,Ruizhe Zhu,Jianwei Yin

Main category: cs.CL

TL;DR: 本文提出了一种名为SA-SFT的轻量级自增强方法,通过让大语言模型在微调前生成自对话数据,并将其与任务数据混合训练,无需额外数据或调整优化策略,即可有效缓解灾难性遗忘并提升领域内性能。

Details Motivation: 解决大语言模型在特定任务微调过程中出现的灾难性遗忘问题,即模型在获得新任务能力的同时丢失原有通用知识和推理能力。 Method: 提出SA-SFT方法:模型先自生成对话数据(self-dialogues),再将这些自生成数据与目标任务数据混合进行监督微调,不改变优化器或训练调度。 Result: 在50个评估场景中,SA-SFT在40个场景中取得最优效果,性能接近原始模型,优于层冻结、外部数据混合等基线方法。 Conclusion: 自增强是一种简单而有效的机制,可在不引发灾难性遗忘的前提下实现大语言模型鲁棒适配。 Abstract: Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting.

[2] Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

Sachin Gopal Wani,Eric Page,Ajay Dholakia,David Ellison

Main category: cs.CL

TL;DR: 本文通过基准测试证明知识蒸馏能显著提升小语言模型的性能-计算效率比,8B蒸馏模型的训练计算成本仅为原始模型的1/2000,却能达到甚至超越十倍大小标准模型的推理能力。

Details Motivation: 开发适用于资源受限环境的高效小型语言模型(SLMs) Method: 对蒸馏模型、原始模型和专有模型进行性能与计算成本的基准测试,并定量分析其效率 Result: 蒸馏模型展现出更优的性能-计算曲线;8B蒸馏模型训练计算效率超原始模型2000倍,推理能力媲美或超越其十倍大小的标准模型 Conclusion: 知识蒸馏不仅是模型压缩技术,更是构建先进且可及AI的主要策略 Abstract: Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI

[3] ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling

Yongda Yu,Lei Zhang,Xinxin Guo,Minghui Yu,Zhengqi Zhuang,Guoping Rong,Haifeng Shen,Zhengfeng Li,Boge Wang,Guoan Zhang,Bangyu Xiang,Xiaobin Xu

Main category: cs.CL

TL;DR: 本文提出ConceptRM方法,利用少量专家标注作为锚点,通过数据扰动和协同教学训练多个模型,从噪声数据中识别可靠的负样本,以低成本构建高质量语料训练反射模型,显著提升误报拦截效果。

Details Motivation: 解决智能代理系统中因大量(多数为误报)告警导致的'告警疲劳'问题,以及用户反馈标注数据在生产环境中噪声大、人工清洗成本高的挑战。 Method: 提出ConceptRM方法:以少量专家标注为锚点,生成不同噪声比的扰动数据集,采用协同教学(co-teaching)训练多个异构模型,并通过分析模型间共识决策,从噪声数据中识别可靠的负样本。 Result: ConceptRM在领域内数据集上误报拦截性能较多个SOTA大语言模型基线提升最多53.31%,在跨领域数据集上提升最多41.67%,且标注成本极低。 Conclusion: ConceptRM是一种高效、低成本的噪声鲁棒方法,能有效利用带噪用户反馈构建高质量训练语料,显著提升反射模型对误报的拦截能力。 Abstract: In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ''alert fatigue''. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.

[4] InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

Yu Li,Pranav Narayanan Venkit,Yada Pruksachatkun,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本文提出了一种基于真实访谈数据的大规模人格模拟评估框架,利用超67万问答对评估内容相似性、事实一致性、人格一致性与知识保留能力,并揭示了检索增强与时间序列方法在不同维度上的权衡。

Details Motivation: 现有评估方法依赖人口统计调查、人格问卷或简短AI访谈,缺乏对个体真实言论的直接评估。 Method: 构建基于23,000份经验证访谈转录本(涵盖1,000位公众人物,总计约11.5万小时)的671,000+问答对数据集;提出四维评估指标(内容相似性、事实一致性、人格对齐度、知识保留),并对比检索增强与时间序列建模方法。 Result: 基于真实访谈数据的方法显著优于仅依赖传记资料或模型参数知识的方法;检索增强法更优于人格风格与响应质量,而时间序列法更优于事实一致性与知识保留。 Conclusion: 该评估框架支持依据应用需求进行方法选择,实证结果为推进人格模拟研究提供了可操作洞见。 Abstract: Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.

[5] What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

William Watson,Nicole Cho,Sumitra Ganesh,Manuela Veloso

Main category: cs.CL

TL;DR: 本文提出查询形式会影响大语言模型(LLM)的幻觉现象,构建了22维查询特征向量,并基于36.9万真实查询发现特定查询特征(如深层从句嵌套、指代不明)与高幻觉风险显著相关,而意图明确、可回答性强的查询则降低幻觉率。

Details Motivation: 传统上将LLM幻觉归因于模型或解码策略缺陷,本文受经典语言学启发,提出查询的形式(即语言结构特征)本身也会影响模型响应,进而影响幻觉发生。 Method: 构建涵盖从句复杂度、词汇罕见性、回指、否定、可回答性、意图锚定等22个维度的查询特征向量;在369,837条真实世界查询上进行大规模统计分析,考察各特征与幻觉率的相关性。 Result: 发现深层从句嵌套和指代不明等特征与高幻觉率一致相关;意图锚定清晰和可回答性强的特征则与低幻觉率相关;领域特异性等其他特征效果则因数据集和模型而异。 Conclusion: 查询的语言学特征构成可观测、可量化的‘风险景观’,为后续基于查询重写的幻觉缓解方法及干预研究提供实证基础和新路径。 Abstract: Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

[6] No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho,William Watson,Alec Koppel,Sumitra Ganesh,Manuela Veloso

Main category: cs.CL

TL;DR: 本文提出QueryBandits,一种模型无关的上下文赌博机框架,用于在线自适应选择最优查询重写策略以缓解大语言模型(尤其是闭源模型)的幻觉问题;在16个问答任务中显著优于基线与静态重写策略,并证明了无统一最优重写策略、且不当静态策略可能加剧幻觉。

Details Motivation: 现有幻觉缓解工作多集中于开源模型的后处理或参数编辑,而忽视了在机构部署中占主导地位的闭源模型;亟需一种不依赖模型内部参数、适用于闭源模型的在线干预方法。 Method: 提出QueryBandits——一种基于上下文赌博机(contextual bandit)的模型无关框架,利用经验验证并校准的奖励函数,在线学习并动态选择最优查询重写策略(如Paraphrase、Expand等);采用Thompson Sampling等策略进行臂选择,并基于语义特征建模上下文。 Result: 在16个QA场景中,最优QueryBandit(Thompson Sampling)相较No-Rewrite基线胜率87.5%,较零样本静态策略(Paraphrase/Expand)分别提升42.6%和60.3%;所有上下文赌博机均优于普通赌博机;发现静态策略可能带来更高累积遗憾,证实无单一最优重写策略。 Conclusion: QueryBandits通过纯前向传播机制实现对闭源LLM的幻觉缓解,无需重训练或梯度更新,为实际部署中安全可控地使用闭源大模型提供了新范式。 Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

[7] Natural Language Processing Models for Robust Document Categorization

Radoslaw Roszczyk,Pawel Tecza,Maciej Stodolski,Krzysztof Siwek

Main category: cs.CL

TL;DR: 本文评估了多种机器学习方法在自动文本分类中的性能,重点权衡分类准确率与计算效率;实验表明BERT精度最高(>99%)但开销大,BiLSTM在精度(98.56%)、速度和鲁棒性间取得最佳平衡,朴素贝叶斯最快(毫秒级)但精度最低(94.5%);研究还构建了一个可运行的演示系统,验证其在技术请求自动路由中的实用性,并指出BiLSTM为当前场景最优选择。

Details Motivation: 在真实世界自动化流程中集成AI时,需兼顾分类准确率与计算效率,尤其面对类别不平衡的文档分类任务。 Method: 对比评估三种模型:朴素贝叶斯、双向LSTM(BiLSTM)和微调后的BERT;设计并实现一个面向不平衡文本分类的端到端演示系统。 Result: BERT准确率最高(>99%),但训练耗时长、资源消耗大;BiLSTM达98.56%准确率,训练成本适中且具备良好上下文理解能力;朴素贝叶斯训练最快(毫秒级),但准确率最低(约94.5%);所有模型在少数类识别上均受类别不平衡影响;演示系统成功实现高吞吐量技术请求自动路由。 Conclusion: BiLSTM在精度、效率与鲁棒性之间提供最优平衡,是该任务下最实用的选择;未来可进一步探索轻量化Transformer架构及不平衡学习策略。 Abstract: This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.

[8] How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity

Chundra Cathcart,Arne Rubehn,Katja Bocklage,Luca Ciucci,Kellen Parker van Dam,Alžběta Kučerová,Jekaterina Mažara,Carlo Y. Meloni,David Snee,Johann-Mattis List

Main category: cs.CL

TL;DR: 本文通过分析52种语言的数词系统,发现许多语言在兼顾词库规模与形态句法复杂度方面的交际效率远低于预期,挑战了以往关于递归数词系统优化 communicative efficiency 的观点。

Details Motivation: 以往研究未充分考虑语言实际表现出的复杂程度,本文旨在重新评估数词系统的交际效率。 Method: 基于52种遗传多样性语言的数据,采用区分可预测与不可预测异形变体的标注方案进行分析。 Result: 多数语言的数词系统在效率上显著低于理论最优水平。 Conclusion: 数词系统的演化可能并非单纯朝向交际效率最大化,需重新思考语言演化中效率原则的作用。 Abstract: Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numeral terms. We argue that previous studies have not characterized the data in a fashion that accounts for the degree of complexity languages display. Using data from 52 genetically diverse languages and an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation), we show that many of the world's languages are decisively less efficient than one would expect. We discuss the implications of our findings for the study of numeral systems and linguistic evolution more generally.

[9] Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

Mukul Chhabra,Luigi Medrano,Arush Verma

Main category: cs.CL

TL;DR: 本文提出了一种面向企业级多轮RAG系统的案例感知式LLM-as-a-Judge评估框架,聚焦操作约束、结构化标识与解决流程,通过8个运营导向指标和严重性感知评分机制,提升诊断清晰度与可扩展性评估能力。

Details Motivation: 现有RAG评估框架主要面向单轮或基准测试场景,难以捕捉企业多轮案例工作流中的关键失败模式(如案例误识别、流程错位、跨轮部分解决等)。 Method: 设计了一个案例感知的LLM-as-a-Judge评估框架,每轮使用8个操作导向指标(涵盖检索质量、依据保真度、答案实用性、精度完整性、案例/流程对齐性),采用严重性感知评分协议和确定性JSON提示工程,支持批量评估、回归测试与生产监控。 Result: 在两类指令微调模型的短/长流程对比实验中,该框架显著优于通用代理指标,能明确揭示影响企业落地的关键权衡点(如精度vs.覆盖率、时效vs.完整性),提供可操作的系统优化信号。 Conclusion: 面向企业多轮RAG的评估需深度嵌入业务语境;所提框架通过结构化、严重性加权、确定性输出的设计,有效支撑真实场景下的诊断、迭代与运维。 Abstract: Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

[10] Disentangling Geometry, Performance, and Training in Language Models

Atharva Kulkarni,Jacob Mitchell Springer,Arjun Subramonian,Swabha Swayamdipta

Main category: cs.CL

TL;DR: 本文系统研究了Transformer模型中unembedding矩阵的几何特性(特别是有效秩)与下游性能之间的关系,发现有效秩更多反映训练选择而非模型性能,无法可靠预测下游任务表现。

Details Motivation: 探索Transformer权重(尤其是unembedding矩阵)的几何性质是否可用于估计语言模型的下游性能。 Method: 在108个OLMo风格语言模型上进行受控实验,分析unembedding矩阵的有效秩及其他几何指标与模型性能的关系,并考察预训练超参数(如batch size、weight decay)的影响。 Result: 有效秩与性能无普适正相关;低有效秩不导致而仅伴随小模型后期性能退化;存在低秩但不饱和的反例;有效秩受预训练超参数强影响;其他几何指标亦无法可靠预测下游性能。 Conclusion: 现有几何指标主要反映训练配置,而非模型内在能力,因此不适合作为下游性能的代理指标。 Abstract: Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model's performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance.

[11] From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility

Gavin Levinson,Keith Feldman

Main category: cs.CL

TL;DR: 本文提出了语言模型效用分类法(LUX),一个涵盖性能、交互、运维和治理四个维度的综合框架,用于系统化评估大语言模型在实际应用中的效用,并配套提供动态网页工具支持指标查询与应用。

Details Motivation: 现有大语言模型评估多聚焦于任务级性能,忽视高风险应用场景中影响效用的社会技术因素,且缺乏统一、可比、结构化的评估分类体系。 Method: 构建四维分层分类框架LUX(性能、交互、运维、治理),每个维度下设主题一致的子维度与组件,并关联可量化的评估指标;同时开发外部动态网络工具以支持指标检索与框架探索。 Result: 提出LUX框架及配套动态Web工具,为不同应用场景下的LLM效用评估与模型选型提供结构化、可量化、可扩展的统一标准。 Conclusion: LUX填补了LLM实用化评估中缺乏系统性、跨领域分类法的空白,推动从单纯任务性能向全面社会技术效用评估的范式转变。 Abstract: As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model's fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.

[12] Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Justin Lovelace,Christian Belardi,Sofian Zalouk,Adhitya Polavaram,Srivatsa Kundurthy,Kilian Q. Weinberger

Main category: cs.CL

TL;DR: STAR-LDM是一种结合潜在扩散规划与自回归生成的语言模型,通过引入‘思考’阶段在连续空间中进行全局语义规划,从而提升语言理解、叙事连贯性和常识推理能力,并支持无需重训练的细粒度可控生成。

Details Motivation: 克服传统自回归语言模型仅能逐token决策、缺乏全局语义规划能力的局限,提升生成质量与可控性。 Method: 提出Stop-Think-AutoRegress框架:在自回归生成过程中插入‘思考’阶段,利用潜在扩散模型在连续隐空间中迭代优化语义计划,再映射回离散token序列;并引入轻量级分类器实现属性控制。 Result: 在语言理解基准上显著优于同规模模型;LLM-as-judge评估中叙事连贯性与常识推理胜率超70%;支持高效、低损的可控生成,优于专用方法。 Conclusion: STAR-LDM验证了在连续隐空间进行扩散式‘思考’可有效增强语言模型的规划能力与可控性,为下一代生成架构提供了新范式。 Abstract: The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a "thinking" phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.

[13] Personal Information Parroting in Language Models

Nishant Subramani,Kshitish Ghate,Mona Diab

Main category: cs.CL

TL;DR: 本文提出了一种新的正则表达式与规则(R&R)检测器套件,用于检测语言模型中记忆的电子邮件、电话号码和IP地址等个人信息(PI),并发现模型大小和预训练步数均与PI记忆率正相关,建议在预训练数据中加强过滤和匿名化。

Details Motivation: 现代语言模型在大规模网络数据上训练,其中包含大量个人信息(PI),模型可能记忆这些信息,带来隐私风险。因此需要有效检测和减少PI记忆。 Method: 开发了R&R检测器套件,基于正则表达式和规则检测三类PI;在人工标注的483个PI实例上评估Pythia系列模型(160M–6.9B参数,70k–143k训练步)的记忆行为,采用前缀提示+贪婪解码方式判断是否逐字复述(parroting)。 Result: Pythia-6.9b模型对13.6%的PI实例进行逐字复述;所有模型均存在记忆现象,最小模型Pythia-160m也有2.7%;记忆率随模型规模和预训练步数增加而上升。 Conclusion: 模型对PI的记忆普遍存在且随规模增大而加剧,应强制对预训练数据进行激进过滤和匿名化以降低隐私风险。 Abstract: Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.

[14] Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches

Saurabh Mishra,Shivani Thakur,Radhika Mamidi

Main category: cs.CL

TL;DR: 本研究评估了多种机器学习模型在仇恨言论检测与中和中的效果,发现BERT等先进模型精度更高,而混合模型在特定场景下表现更优;同时提出创新性文本转换方法以将负面表达转为中性。

Details Motivation: 社交媒体上仇恨言论激增,亟需高效检测与干预工具。 Method: 比较CNN、LSTM等传统模型与BERT及其衍生模型,探索混合架构,并引入新型文本转换技术以中和仇恨内容。 Result: BERT类模型因深层语境理解能力表现出更高准确率;混合模型在某些场景下更具优势;所提文本转换方法可有效将负面表达转为中性。 Conclusion: 当前模型各具优劣,需结合任务场景选择;文本中和是缓解危害的新路径;未来应构建更鲁棒、可解释、兼顾公平性的检测系统。 Abstract: The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and offensive language and investigates the potential of text transformation techniques to neutralize such content. We compare traditional models like CNNs and LSTMs with advanced neural network models such as BERT and its derivatives, alongside exploring hybrid models that combine different architectural features. Our results indicate that while advanced models like BERT show superior accuracy due to their deep contextual understanding, hybrid models exhibit improved capabilities in certain scenarios. Furthermore, we introduce innovative text transformation approaches that convert negative expressions into neutral ones, thereby potentially mitigating the impact of harmful content. The implications of these findings are discussed, highlighting the strengths and limitations of current technologies and proposing future directions for more robust hate speech detection systems.

[15] Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books

W. Frederick Zimmerman

Main category: cs.CL

TL;DR: 本文提出语义新颖性作为衡量大规模文本叙事结构的信息论指标,并基于PG19语料库分析了28606部19世纪英语文学作品,识别出八种典型叙事形态,发现信息密度动态特征(如‘体积’、‘速度’)能独立于文本长度预测读者关注度,且受体裁与历史时期显著影响。

Details Motivation: 传统叙事分析多依赖人工标注或情感/主题等浅层特征,缺乏对信息密度动态变化的量化刻画;本文旨在构建一种可扩展、信息论驱动的叙事结构度量方法,揭示其与读者接受度及文体演化的深层关联。 Method: 定义语义新颖性为每段句子嵌入与此前所有段落运行质心的余弦距离;在PG19数据集上使用768维SBERT嵌入计算段级新颖性曲线,经16段PAA降维后采用Ward层次聚类识别叙事原型;结合SAX符号化分析、偏相关检验与卡方检验评估变量关系及历史趋势。 Result: 发现八类叙事形状原型(如陡降型、陡升型);‘体积’(新颖性轨迹方差)是最强的长度无关读者关注度预测因子(partial rho = 0.32);体裁显著约束叙事形状(p < 10⁻²⁴²);1840–1910年间作品趋于更可预测(T/I比下降);SAX分析显示85%书籍具有唯一符号签名。 Conclusion: 信息密度动态是区别于情感与主题的叙事基本维度,其量化特征具有跨文本可比性与实证解释力,为数字人文与计算叙事学提供了新范式。 Abstract: I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty

[16] CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models

Anqi Li,Chenxiao Wang,Yu Lu,Renjun Xu,Lizhi Ma,Zhenzhong Lan

Main category: cs.CL

TL;DR: 本文提出CARE框架,利用大语言模型(LLaMA-3.1-8B-Instruct)结合专家标注的9516条理由,从咨询对话文本中自动预测多维治疗联盟评分并生成可解释的理由,显著提升与来访者评分的相关性(Pearson相关提升超70%),并在真实中文在线心理咨询场景中验证其临床实用性。

Details Motivation: 传统问卷法负担重、延迟高;现有计算方法评分粗粒度、缺乏可解释性、忽视整体会话上下文,难以准确捕捉来访者对治疗联盟的感知。 Method: 基于CounselingWAI数据集,构建CARE框架:采用理由增强监督微调LLaMA-3.1-8B-Instruct模型,实现多维联盟评分预测与可解释理由生成。 Result: CARE在多维联盟预测上显著优于主流大模型,与来访者评分的Pearson相关提升超70%;理由生成质量经自动与人工评估均表现优异;在真实中文在线咨询中成功识别常见联盟挑战并提供可操作洞见。 Conclusion: CARE是一种具备高准确性、强可解释性与实用性的AI辅助工具,有望有效支持心理健康服务中的治疗联盟监测与干预优化。 Abstract: Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.

[17] CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Kun Xu,Yang You

Main category: cs.CL

TL;DR: 本文提出CAMEL框架,通过基于置信度的轻量级单token偏好判断与选择性反思机制,在保持高效的同时提升奖励模型性能,显著优于现有方法。

Details Motivation: 现有奖励模型存在效率与可解释性难以兼顾的问题:判别式模型高效但不可解释,生成式模型可解释但计算开销大;需一种兼顾准确率、效率与可解释性的新范式。 Method: 提出CAMEL——一种置信度门控的反思框架:首先进行轻量单token偏好判断,仅对低置信度样本触发反思;采用反事实前缀增强的强化学习进行训练,促使模型真正修正初始错误判断。 Result: 在三个主流奖励模型基准上达到82.9%平均准确率,超越最优基线3.2%,且以14B参数超越70B参数模型,在准确率-效率帕累托前沿上严格占优。 Conclusion: CAMEL验证了置信度驱动的选择性反思是一种高效提升奖励模型性能的新范式,为构建高效、可靠、可解释的对齐系统提供了新思路。 Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

[18] ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition

Xindian Ma,Rundong Kong,Peng Zhang,Ruoxiang Huang,Yongyu Jiang

Main category: cs.CL

TL;DR: ID-LoRA是一种新型参数高效微调框架,通过复用预训练权重矩阵中聚类的参数组构建多个共享单一低秩矩阵的低秩组件,在显著减少可训练参数(最多减少46%)的同时保持甚至提升模型性能,尤其在多任务场景下优于LoRA及其变体。

Details Motivation: 现有LoRA及其变体在大模型上仍引入较多可训练参数开销;降低秩虽能减参但严重损害多任务性能,亟需打破参数量与性能间的权衡。 Method: 提出ID-LoRA:从预训练权重矩阵中提取并复用聚类的参数组,构建多个低秩组件,所有组件共享一个初始化的可训练低秩矩阵。 Result: 在数学推理、代码生成、MMLU、CommonsenseQA和安全对齐五个基准上均优于全量微调及LoRA、DoRA、HydraLoRA等PEFT基线;相比标准LoRA最多减少46%可训练参数;多任务下Code和MMLU任务性能超越LoRA及其变体,仅需其54%参数量。 Conclusion: ID-LoRA成功解耦了参数效率与模型容量之间的矛盾,为大模型多任务高效适配提供了新范式。 Abstract: LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model's capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.

[19] Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization

Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi

Main category: cs.CL

TL;DR: 本文提出自适应文本匿名化新任务,通过任务特定的提示优化框架自动构建针对不同隐私目标、领域和下游使用模式的匿名化指令,在多个数据集上实现了优于现有基线的隐私-效用权衡。

Details Motivation: 现有文本匿名化方法依赖静态、人工设计的策略,缺乏应对多样化需求和跨领域泛化的能力。 Method: 提出一种面向任务的提示优化框架,自动为语言模型构建匿名化指令,以适配特定的隐私-效用要求。 Result: 在涵盖五个不同领域、隐私约束与效用目标的数据集基准上,该方法在所有设置中均优于现有基线,且在开源语言模型上高效有效,性能媲美更大规模的闭源模型;还能发现探索隐私-效用前沿的新匿名化策略。 Conclusion: 自适应文本匿名化是一种更灵活、通用且实用的文本隐私保护范式,能动态平衡隐私保护与数据效用。 Abstract: Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.

[20] Explicit Grammar Semantic Feature Fusion for Robust Text Classification

Azrin Sultana,Firoz Ahmed

Main category: cs.CL

TL;DR: 本文提出了一种轻量级文本分类模型,通过显式编码句法结构(如短语模式、复杂度指标)为紧凑的语法向量,并与冻结的上下文嵌入融合,无需完整Transformer或重型深度学习架构,显著提升边缘设备性能。

Details Motivation: 现有基于Transformer的NLP模型计算开销大,不适用于资源受限环境;需兼顾语法结构与语义信息的轻量高效方案。 Method: 构建句法结构显式编码的‘语法向量’,融合冻结的上下文嵌入形成统一表征;在DBN、LSTM、BiLSTM、BERT、XLNet等模型上验证,调整训练轮数。 Result: 所提模型在多个基准上超越基线模型2%-15%,兼具结构与语义建模能力,且参数量小、适合边缘部署。 Conclusion: 将语法作为显式归纳偏置(而非可学习模块)是构建高性能轻量级NLP模型的有效范式,优于依赖额外注意力层或树编码器的语法增强Transformer方法。 Abstract: Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices

[21] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

Yifei Xu,Guilherme Potje,Shivam Shandilya,Tiancheng Yuan,Leonardo de Oliveira Nunes,Rakshanda Agarwal,Saeid Asgari,Adam Atkinson,Emre Kıcıman,Songwu Lu,Ranveer Chandra,Tusher Chakraborty

Main category: cs.CL

TL;DR: 本文提出SibylSense方法,在推理时通过可调记忆库动态更新冻结的评分标准生成器,以提升开放生成任务中奖励设计的对齐性与鲁棒性。

Details Motivation: 现有评分标准(rubrics)构建方式存在成本高、不一致或易饱和漂移等问题,难以支撑强化学习后训练中的高质量奖励设计。 Method: SibylSense采用基于验证器的项目奖励机制,利用少量示例计算参考答案与候选答案间的判别差距来更新记忆库;同时交替进行记忆调优和对抗性策略更新,促使评分标准捕捉新的质量维度。 Result: 在两个开放生成任务上的实验表明,SibylSense生成的评分标准更具判别力,并显著提升下游强化学习性能,优于静态及非自适应基线方法。 Conclusion: SibylSense提供了一种高效、可扩展且鲁棒的推理时评分标准自适应机制,有效缓解了开放生成中奖励设计的关键瓶颈。 Abstract: Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

[22] Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu,Seongho Son,Ilija Bogunovic

Main category: cs.CL

TL;DR: 本文提出OP-GRPO框架,通过隐式强化学习实现单一大语言模型生成多元价值响应,无需显式提示或模块化设计,在小模型上实现更广的人类价值观覆盖。

Details Motivation: 现有对齐范式难以捕捉人类价值观的多元性,需一种能从单一查询生成多样化视角响应的新方法。 Method: 提出OP-GRPO——一种基于隐式Overton多元主义的强化学习框架;包含两步:1)训练专用Sentence Transformer作为相似度估计器以精准评估响应覆盖度;2)将该估计器嵌入双奖励机制中,兼顾视角广度与唯一性,提升多样性。 Result: Qwen2.5-3B-Instruct模型在NLI基准上相对20B GPT-OSS基线提升37.4%准确率,且优于模块化架构基线19.1%;GPT-4.1评测进一步验证其鲁棒性。 Conclusion: OP-GRPO有效实现了小模型大覆盖的‘小模型、大视角覆盖’效应,为价值观对齐提供了高效、轻量、隐式的多元响应生成新路径。 Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

[23] Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

Sayantan Dasgupta,Trevor Cohn,Timothy Baldwin

Main category: cs.CL

TL;DR: 本文提出了一种尾部感知的散度(tail-aware divergence)用于语言模型蒸馏,通过解耦教师模型前K个高概率预测与低概率预测的贡献,增强分布尾部信息的影响,从而提升学生模型性能,并保持计算效率。

Details Motivation: 传统KL散度在语言模型蒸馏中易被教师模型的高概率预测(即模态)主导,削弱了低概率但可能富含信息的尾部预测的作用。 Method: 提出一种新的尾部感知散度,解耦教师模型前K高概率预测与其余低概率预测的贡献,保持与KL散度相同的计算开销。 Result: 在多种数据集上的预训练与监督式解码器模型蒸馏实验表明,该方法性能具有竞争力,且蒸馏过程高效,仅需学术级计算资源即可处理大规模数据。 Conclusion: 所提尾部感知蒸馏方法能更均衡地利用教师输出分布的全貌,尤其强化尾部知识迁移,在性能与效率间取得良好平衡。 Abstract: The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

[24] The Art of Efficient Reasoning: Data, Reward, and Optimization

Taiqiang Wu,Zenan Zu,Bo Zhou,Ngai Wong

Main category: cs.CL

TL;DR: 本文系统研究了大语言模型高效推理的机制,提出细粒度评估指标,发现训练分为长度适应和推理精炼两阶段,并强调在较简单提示上训练以避免长度坍缩,所得长度偏差可跨领域泛化。

Details Motivation: 大型语言模型在链式思维推理中虽性能提升但计算开销巨大,亟需激励短而准确的推理路径。 Method: 通过强化学习中的奖励塑形进行高效推理训练,设计细粒度评估指标(如按正确性条件化的长度分布、2k-32k token预算下的性能),开展约0.2百万GPU小时的统一协议实验,分析训练提示、rollout、奖励设计与优化策略。 Result: 发现训练呈两阶段:长度适应与推理精炼;在较易提示上训练可提高正向奖励密度、防止长度坍缩;学习到的长度偏差具有跨领域泛化能力;所有结论在Qwen3系列(0.6B至30B)上验证有效。 Conclusion: 本文提炼出关于高效推理训练的关键洞见与实用指南,证实其在不同规模模型上的鲁棒性与泛化性。 Abstract: Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

[25] On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi,Grace Lam,Mohammad Shoeybi,Pooya Jannaty,Bryan Catanzaro,Wei Ping

Main category: cs.CL

TL;DR: 本文提出Terminal-Task-Gen合成任务生成流程和Terminal-Corpus数据集,训练出Nemotron-Terminal系列模型,在Terminal-Bench 2.0上显著提升性能,并开源模型与数据。

Details Motivation: 当前大语言模型在终端能力上进展迅速,但其背后训练数据策略大多未公开,本文旨在系统研究终端智能体的数据工程实践。 Method: 提出轻量级合成任务生成管道Terminal-Task-Gen(支持种子驱动和技能驱动构建),并开展数据过滤、课程学习、长上下文训练及扩展行为等训练策略的综合分析;基于此构建开源终端任务数据集Terminal-Corpus,并以Qwen3为基座训练Nemotron-Terminal系列模型。 Result: Nemotron-Terminal在Terminal-Bench 2.0上大幅提升:8B、14B、32B版本分别从2.5%、4.0%、3.4%提升至13.0%、20.2%、27.4%,媲美更大规模模型。 Conclusion: 数据工程对终端智能体性能至关重要;Terminal-Task-Gen和Terminal-Corpus为该领域提供了可复现、可扩展的基础,开源举措将推动终端智能体研究发展。 Abstract: Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

cs.CV [Back]

[26] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

Dorsa EPMoghaddam,Feng Gao,Drew Bernard,Kavya Sinha,Mehdi Razavi,Behnaam Aazhang

Main category: cs.CV

TL;DR: 本文提出了一种基于ICE视频和3D卷积神经网络的AI框架,用于自动定位心律失常起源,三分类准确率达66.2%,显著优于随机基线,展示了其在缩短消融手术时间、提升精准干预方面的临床潜力。

Details Motivation: 现有高密度标测和术前CT/MRI在心律失常定位中耗时耗资源;而ICE是电生理手术常规手段,尚未被充分用于实时辅助定位,因此亟需一种快速、准确、可整合进术中的AI方法。 Method: 将心律失常源定位建模为三分类任务(窦性心律、左侧/右侧心律失常),使用3D卷积神经网络对ICE视频数据进行训练和分类,并采用十折交叉验证评估性能。 Result: 模型在4例未见患者上平均准确率为66.2%,显著高于33.3%的随机基线;验证了ICE视频结合深度学习实现自动化心律失常定位的可行性。 Conclusion: 该AI框架具备临床转化潜力,有望加快靶点识别、优化电生理干预流程、降低导管消融手术负担;后续需扩大数据集以提升模型鲁棒性与泛化能力。 Abstract: Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.

[27] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Xiwen Chen,Wenhui Zhu,Gen Li,Xuanzhao Dong,Yujian Xiong,Hao Wang,Peijie Qiu,Qingquan Song,Zhipeng Wang,Shao Tang,Yalin Wang,Abolfazl Razi

Main category: cs.CV

TL;DR: 本文提出OTPrune,一种无需训练的视觉令牌剪枝框架,利用最优传输对齐完整与剪枝后令牌的分布,兼顾局部多样性和全局代表性,提升多模态大语言模型推理效率。

Details Motivation: 多模态大语言模型(MLLMs)因冗余视觉令牌导致高推理开销;现有剪枝方法忽视视觉表征的分布结构。 Method: 将剪枝建模为通过最优传输(OT)实现的分布对齐问题,最小化2-Wasserstein距离;设计可解的子模目标函数并证明其单调性与子模性。 Result: 在更广泛基准上验证了OTPrune在性能-效率权衡上优于现有最先进方法。 Conclusion: OTPrune提供了一种原理清晰、稳定高效、语义保真度高的无训练视觉令牌剪枝新范式。 Abstract: Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

[28] De-rendering, Reasoning, and Repairing Charts with Vision-Language Models

Valentin Bonas,Martin Sinnona,Viviana Siless,Emmanuel Iarussi

Main category: cs.CV

TL;DR: 本文提出了一种结合图表反渲染、自动分析和迭代优化的框架,用于提供可操作、可解释的可视化设计反馈,提升图表质量和用户可视化素养。

Details Motivation: 现有基于规则的可视化检查工具缺乏上下文理解能力,而通用大语言模型又因未专门训练于可视化设计原则而反馈不可靠。 Method: 通过图表反渲染重建图表结构,利用视觉-语言推理识别设计缺陷,并依据可视化研究原则提出具体修改建议,支持用户选择性采纳与重新渲染,形成反馈闭环。 Result: 在Chart2Code基准的1000张图表上生成10452条建议,聚类为10类(如坐标轴格式、色彩可访问性、图例一致性等)。 Conclusion: LLM驱动的推荐系统能提供结构化、基于原则的可视化设计反馈,有望推动更智能、更易用的可视化创作工具发展。 Abstract: Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.

[29] N4MC: Neural 4D Mesh Compression

Guodong Chen,Huanshuo Dong,Mallesham Dasari

Main category: cs.CV

TL;DR: N4MC是一种首个面向时变网格序列的4D神经压缩框架,通过建模时间冗余(如运动补偿与插值)实现高效压缩,并在率失真性能和实时解码方面优于现有方法。

Details Motivation: 现有神经网格压缩方法独立处理每一帧,未利用时序冗余;而4D网格序列具有强时间相关性,亟需类似视频编码的帧间压缩机制。 Method: 将不规则网格帧转换为规则4D张量;用自解码器联合建模时空相关性以去除冗余;引入基于Transformer的插值模型,利用追踪体素中心的潜在嵌入预测中间帧以提升时序一致性。 Result: 在率失真性能上超越当前最优方法,并支持4D网格序列的实时解码。 Conclusion: N4MC首次将帧间压缩思想引入神经网格压缩,验证了显式建模时空冗余对4D几何数据压缩的有效性与实用性。 Abstract: We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: https://github.com/frozzzen3/N4MC.

[30] GSNR: Graph Smooth Null-Space Representation for Inverse Problems

Romario Gualdrón-Hurtado,Roman Jacome,Rafael S. Suarez,Henry Arguello

Main category: cs.CV

TL;DR: 本文提出Graph-Smooth Null-Space Representation (GSNR),通过在感知矩阵的零空间中引入图平滑先验,提升逆问题成像重建质量,在多个任务中显著优于基线方法。

Details Motivation: 传统图像先验(如稀疏性、平滑性)无法约束感知矩阵的零空间分量,导致重建偏差;因此需引入有意义的零空间信息。 Method: 基于图信号处理,构建零空间限制的拉普拉斯矩阵,并设计由p个最平滑图谱模式构成的低维投影矩阵,仅对不可观的零空间分量施加结构化正则。 Result: 在图像去模糊、压缩感知、去马赛克和超分辨率四个任务中,PSNR较基线提升最高达4.3 dB,较端到端学习模型提升达1 dB;同时具备更好的收敛性、零空间方差覆盖度与可预测性。 Conclusion: GSNR为逆问题提供了一种理论严谨且实用有效的零空间建模新范式,可无缝嵌入PnP、DIP及扩散求解器等主流框架。 Abstract: Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.

[31] Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Jingcheng Yang,Tianhu Xiong,Shengyi Qian,Klara Nahrstedt,Mingyuan Wu

Main category: cs.CV

TL;DR: 本文提出了首个用于视觉语言模型(VLMs)的透明电路追踪框架,揭示其多模态推理机制,并验证了所发现电路的因果性与可控性。

Details Motivation: 视觉语言模型(VLMs)虽强大但缺乏可解释性,亟需系统性分析其多模态推理过程。 Method: 利用转码器(transcoders)、归因图(attribution graphs)和基于注意力的方法进行电路追踪与分析,并通过特征引导(feature steering)和电路修补(circuit patching)验证因果性。 Result: 发现了VLM中分层整合视觉与语义概念的机制,识别出能处理数学推理及支持跨模态关联的特定视觉特征电路。 Conclusion: 所提出的框架不仅证实了这些电路的因果性和可控性,也为构建更可解释、更可靠的VLM奠定了基础。 Abstract: Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

[32] Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques

Christos Maikos,Georgios Angelidis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种端到端的无人机视频流实时3D重建管线,融合RTMP流、传感器同步、位姿估计与3D高斯泼溅(3DGS)优化,显著降低延迟并提升渲染性能,适用于AR/VR等沉浸式应用。

Details Motivation: 将新兴的3D高斯泼溅(3DGS)技术集成到无人机实时感知系统中,填补端到端重建与低延迟可视化之间的研究空白。 Method: 构建包含RTMP视频流接入、同步传感器融合、相机位姿估计和在线3DGS优化的端到端流水线,支持连续模型更新与低延迟交互式可视化。 Result: 相比NeRF方法,该方法在保持4–7%离线高质量参考重建精度的同时,显著提升渲染性能并大幅降低端到端延迟。 Conclusion: 所提系统在视觉保真度、实时性与可扩展性方面表现优异,适合空中平台的实时增强感知任务。 Abstract: In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7\% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.

[33] BiRQA: Bidirectional Robust Quality Assessment for Images

Aleksandr Gushchin,Dmitriy S. Vatolin,Anastasia Antsiferova

Main category: cs.CV

TL;DR: BiRQA是一种高效、鲁棒的全参考图像质量评估(FR IQA)神经网络模型,采用双向多尺度金字塔与不确定性感知注意力机制,并引入锚定对抗训练提升鲁棒性,在精度、速度和抗攻击能力上均达SOTA。

Details Motivation: 现有神经图像质量评估模型存在推理速度慢、易受对抗扰动影响的问题,亟需兼顾高效性与鲁棒性的新方法。 Method: 提出BiRQA模型:基于四路快速互补特征的双向多尺度金字塔结构;底部向上注意力模块通过不确定性感知门控融合细粒度线索;顶部向下交叉门控模块回传语义上下文;并设计锚定对抗训练(Anchored Adversarial Training),结合干净锚样本与排序损失约束点预测误差。 Result: 在五个公开FR IQA基准上性能达或超SOTA,推理速度快约3倍;在KADID-10k数据集白盒攻击下,SROCC从0.30–0.57显著提升至0.60–0.84。 Conclusion: BiRQA是首个同时实现高精度、实时吞吐与强对抗鲁棒性的FR IQA模型,为实际部署提供了新范式。 Abstract: Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean "anchor" samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.

[34] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

Bhavik Chandna,Kelsey R. Allen

Main category: cs.CV

TL;DR: 本文提出3DSPA,一种无需参考视频的自动化视频真实性评估框架,融合3D点轨迹、深度线索和DINO语义特征,能有效检测物理规律违反、运动伪影,并与人类判断高度一致。

Details Motivation: 现有视频真实性评估依赖人工标注或特定数据集,覆盖范围有限,缺乏自动化、通用且无需参考视频的评估方法。 Method: 提出3D spatiotemporal point autoencoder(3DSPA),联合建模3D点轨迹、深度信息和DINO语义特征,构建统一时空表征,用于评估视频的真实性、时序一致性和物理合理性。 Result: 3DSPA在多个数据集上显著优于现有方法:更可靠地识别违反物理规律的视频、对运动伪影更敏感、与人类对视频质量和真实性的评分高度相关。 Conclusion: 融合3D运动轨迹与语义信息的表征方式为生成式视频模型评测提供了更鲁棒、更具物理意义的基础,可隐式捕捉物理规则违反。 Abstract: AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.

[35] Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field

Sheyang Tang,Armin Shafiee Sarvestani,Jialu Xu,Xiaoyu Xu,Zhou Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏图像输入的3D美学场建模方法,利用高斯泼溅网络将2D美学知识蒸馏到3D空间,并通过两阶段搜索高效推荐美观相机视角,避免了密集采集或强化学习的高成本。

Details Motivation: 现有美学视角建议方法受限于单视图调整(缺乏几何理解)或依赖密集重建/预建3D环境与昂贵强化学习搜索,难以兼顾效率与几何感知。 Method: 构建3D美学场,使用前馈式3D高斯泼溅网络将预训练2D美学模型的知识蒸馏至3D空间;在此基础上设计粗粒度采样加梯度精调的两阶段搜索流程。 Result: 在多种场景下显著优于现有方法,在构图与画面布局质量上表现更优,验证了稀疏输入下高效、几何感知的3D美学建模可行性。 Conclusion: 该工作开辟了3D感知美学建模的新方向,为真实场景中轻量、高效的自动构图提供了可行框架。 Abstract: The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

[36] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Mainak Singha,Sarthak Mehrotra,Paolo Casari,Subhasis Chaudhuri,Elisa Ricci,Biplab Banerjee

Main category: cs.CV

TL;DR: CLIPoint3D 是首个基于 CLIP 的少样本无监督 3D 点云域自适应框架,通过冻结 CLIP 主干、知识驱动的提示调优、熵引导视图采样及最优传输与不确定性感知原型对齐损失,在 PointDA-10 和 GraspNetPC-10 上显著提升精度(3–16%)且保持高效。

Details Motivation: 现有视觉语言模型(如 CLIP)在 3D 点云跨域(尤其合成→真实)任务中鲁棒性差;传统 3D 域自适应方法依赖可训练重型编码器,牺牲效率。 Method: 将点云投影为多深度图,利用冻结 CLIP 主干;引入知识驱动的提示调优(融合语言先验与轻量 3D 编码器的几何线索);对 CLIP 编码器进行参数高效微调;设计熵引导视图采样策略;联合使用最优传输对齐损失和不确定性感知原型对齐损失。 Result: 在 PointDA-10 和 GraspNetPC-10 上一致超越 CLIP 基线和传统编码器基线,准确率提升 3–16%。 Conclusion: CLIPoint3D 在保持 CLIP 高效性的同时,显著提升了少样本无监督 3D 点云域自适应性能,验证了冻结大模型+轻量适配+多模态先验融合的有效范式。 Abstract: Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

[37] SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal,Subash Khanal,Srikumar Sastry,Jacob Arndt,Philipe Ambrozio Dias,Dalton Lunga,Nathan Jacobs

Main category: cs.CV

TL;DR: 本文提出SimLBR框架,通过潜在空间混合正则化(LBR)学习真实图像分布的紧致决策边界,将伪造图像视为“汇类”,显著提升跨生成器泛化能力与检测鲁棒性,并引入面向可靠性的评估指标。

Details Motivation: 现有AI生成图像检测方法在训练数据上过拟合,面对强分布偏移的难样本测试集时性能急剧下降,亟需更鲁棒、泛化性更强的检测范式。 Method: 提出SimLBR框架,核心为潜在混合正则化(LBR),不显式建模伪造图像分布,而是聚焦于学习真实图像在潜在空间中的紧致边界,将伪造样本统一视为外部‘sink class’;同时引入风险调整指标和最坏情况估计以评估可靠性。 Result: 在Chameleon基准上实现最高+24.85%准确率和+69.62%召回率提升;训练速度比现有方法快数个数量级;具备优异的跨生成器泛化能力。 Conclusion: 以真实分布为中心建模、辅以可靠性导向评估,是提升AI生成图像检测鲁棒性与实用性的更优路径;SimLBR为该方向提供了简单、高效且可扩展的解决方案。 Abstract: The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85\% accuracy and +69.62\% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.

[38] gQIR: Generative Quanta Image Reconstruction

Aryan Garg,Sizhuo Ma,Mohit Gupta

Main category: cs.CV

TL;DR: 本文提出了一种将大型文本到图像潜在扩散模型适配到光子受限的量子爆发成像领域的新方法,通过引入处理伯努利光子统计的机制,在极低光条件下实现了高质量、感知友好的图像重建。

Details Motivation: 单光子雪崩二极管(SPAD)传感器在极低光条件下具有潜力,但其原始“量子帧”稀疏、噪声大且为二值化光子检测,传统图像复原方法和现代生成模型难以应对这种特殊的噪声统计特性。 Method: 将大规模文本到图像潜在扩散模型适配至量子爆发成像任务,结合潜在空间重建与爆发级时空推理,并专门设计机制以建模伯努利光子统计。 Result: 在合成基准和新构建的真实世界数据集(包括首个彩色SPAD爆发数据集和挑战性的Deforming (XD)视频基准)上,该方法显著优于经典及现代基于学习的基线方法,尤其在高速运动场景下仍保持光度准确性和视觉质量。 Conclusion: 大型生成先验可成功迁移至极端光子受限传感任务,为计算成像开辟了新路径。 Abstract: Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{https://github.com/Aryan-Garg/gQIR}{https://github.com/Aryan-Garg/gQIR}.

[39] MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Taha Koleilat,Hojat Asgariandehkordi,Omid Nejati Manzari,Berardino Barile,Yiming Xiao,Hassan Rivaz

Main category: cs.CV

TL;DR: 本文提出MedCLIPSeg框架,将CLIP模型适配于医学图像分割任务,通过概率跨模态注意力机制和软patch级对比损失,实现数据高效、鲁棒且具备不确定性建模能力的文本引导分割。

Details Motivation: 医学图像分割面临标注稀缺、解剖特征模糊及域偏移等挑战;现有视觉-语言模型(如CLIP)在密集、文本引导的医学分割中潜力尚未充分挖掘。 Method: 提出MedCLIPSeg:利用patch级CLIP嵌入,引入概率跨模态注意力实现图文双向交互,并显式建模预测不确定性;设计软patch级对比损失以增强多文本提示下的细粒度语义学习。 Result: 在16个涵盖5种影像模态、6类器官的数据集上验证,MedCLIPSeg在精度、效率和鲁棒性上均优于先前方法,并生成可解释的局部不确定性图。 Conclusion: 证明了概率化视觉-语言建模在文本驱动医学图像分割中的有效性与前景。 Abstract: Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

[40] SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Anindita Ghosh,Vladislav Golyanik,Taku Komura,Philipp Slusallek,Christian Theobalt,Rishabh Dabral

Main category: cs.CV

TL;DR: 本文提出SceMoS框架,利用轻量级2D场景表示(鸟瞰图BEV和局部高度图)替代传统计算密集型3D场景数据,实现语义意图与物理可行性的解耦建模,在TRUMANS基准上达到SOTA性能并显著降低参数量。

Details Motivation: 现有文本驱动3D人体运动合成方法依赖计算昂贵的3D场景数据(如点云、体素),且难以兼顾高层语义规划与底层物理接触推理。 Method: SceMoS采用两阶段解耦设计:(1) 基于DINOv2编码的鸟瞰图(BEV)图像的文本条件自回归全局运动规划器;(2) 基于2D局部高度图的几何感知运动分词器(通过条件VQ-VAE训练),将表面物理嵌入离散词表。 Result: 在TRUMANS基准上取得运动真实感和接触准确率的SOTA结果,场景编码可训练参数减少超50%。 Conclusion: 结构化的2D场景表示(BEV语义+局部高度图)可有效替代全3D监督,实现高效且物理可信的3D人-场景交互建模。 Abstract: Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

[41] Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation

Lin Li,Ziqi Jiang,Gefan Ye,Zhenqi He,Jiahui Li,Jun Xiao,Kwang-Ting Cheng,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于双曲流匹配(HFM)的跨模态少样本自适应方法,通过利用洛伦兹流形的指数扩展特性解决欧氏流匹配中的路径纠缠问题,包含向心双曲对齐和路径解耦目标两大设计,并引入自适应直径停止机制,显著提升了性能。

Details Motivation: 欧氏空间下的流匹配方法存在平坦几何限制,多项式体积增长无法适应多样化的特征分布,导致严重路径纠缠。 Method: 提出路径解耦的双曲流匹配(HFM),包括:1)向心双曲对齐——以文本为根锚点构建层次结构,将视觉特征推向边界以初始化有序流;2)路径解耦目标——作为‘语义护栏’,通过逐步监督将轨迹严格约束在类特定测地线走廊内;并设计基于直径的自适应停止机制防止过度传输。 Result: 在11个基准上广泛消融实验表明,HFM持续超越欧氏对应方法,建立新的SOTA性能。 Conclusion: 双曲几何为跨模态少样本适应提供了更合适的表征空间,HFM通过结构化流与几何感知约束有效缓解路径纠缠,提升泛化能力。 Abstract: Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold's exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail'' rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

[42] Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

Jintu Zheng,Qizhe Liu,HuangXin Xu,Zhuojie Chen

Main category: cs.CV

TL;DR: 本文提出PipStereo,通过渐进式迭代剪枝、协同单目先验迁移框架和硬件感知的FlashGRU算子,在边缘设备上实现高精度实时立体匹配。

Details Motivation: 现有基于RNN的迭代立体匹配方法难以部署到边缘设备,且其迭代过程存在空间稀疏性和时间冗余性未被充分挖掘。 Method: 1)提出渐进式迭代剪枝策略以减少冗余更新;2)设计协同单目先验迁移框架,隐式嵌入深度先验而不引入额外编码器;3)开发硬件感知的FlashGRU算子,利用结构化稀疏与I/O优化设计。 Result: PipStereo在Jetson Orin NX上以FP16处理320×640帧仅需75ms,在RTX 4090上仅需19ms,精度媲美大型迭代模型,且显著优于现有实时方法;FlashGRU相较原生ConvGRU提速7.28×,峰值内存降低76.6%,全局内存请求减少80.9%。 Conclusion: PipStereo在保持高精度的同时大幅提升了边缘端立体匹配的效率与实用性,为嵌入式AI视觉任务提供了新范式。 Abstract: While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: https://github.com/XPENG-Aridge-AI.

[43] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Peiliang Cai,Jiacheng Liu,Haowen Xu,Xinyu Wang,Chang Zou,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种可学习的、阶段感知的特征预测框架LESA,用于加速扩散Transformer(DiTs),通过两阶段训练和Kolmogorov-Arnold网络(KAN)建模时序特征映射,并采用多阶段多专家结构提升不同噪声水平下的预测精度,在多个模型上实现显著加速与质量提升。

Details Motivation: Diffusion Transformers(DiTs)计算开销大,现有基于简单重用或无训练预测的特征缓存方法难以适应扩散过程复杂的、阶段依赖的动态特性,导致生成质量下降且无法保持与标准去噪过程的一致性。 Method: 提出LEarnable Stage-Aware(LESA)预测器框架:1)采用Kolmogorov-Arnold Network(KAN)建模时序特征映射;2)设计两阶段训练策略;3)引入多阶段、多专家架构,为不同噪声水平阶段分配专用预测器。 Result: 在FLUX.1-dev上达5.00×加速(质量仅降1.0%);Qwen-Image上6.25×加速且质量比TaylorSeer提升20.2%;HunyuanVideo上5.00×加速且PSNR比TaylorSeer提升24.7%;在文本到图像和文本到视频任务中均达到SOTA性能。 Conclusion: LESA是一种通用、高效、高质量的DiT加速框架,其基于训练的阶段感知预测机制显著优于现有无训练或启发式方法,具备强泛化能力。 Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

[44] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang,Xuesong Li,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出视觉系统真正理解affordance需具备几何感知与交互感知两种能力,并通过融合DINO的几何结构与Flux的交互注意力图,在零样本、无训练条件下实现具竞争力的affordance估计。

Details Motivation: 探究视觉系统如何真正理解affordance,提出其依赖于几何感知与交互感知两个互补能力。 Method: 系统探针分析视觉基础模型(VFMs),分别提取DINO的部件级几何结构和Flux的动词条件化空间注意力图,并在无需训练的情况下进行零样本融合。 Result: 发现DINO天然编码几何结构,Flux隐含交互先验;二者可组合融合,零样本下affordance估计性能媲美弱监督方法。 Conclusion: 几何感知与交互感知是VFMs中affordance理解的基本构成单元,为感知如何支撑行动提供了机制性解释。 Abstract: What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

[45] Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models

Limai Jiang,Ruitao Xie,Bokai Yang,Huazhen Huang,Juan He,Yufu Huo,Zikai Wang,Yang Wei,Yunpeng Cai

Main category: cs.CV

TL;DR: 本文提出了一种基于因果推断的医学图像分割可解释性方法,通过反向传播平均处理效应(ATE)量化输入区域和网络组件对分割结果的影响,并在多个模型和数据集上验证了其解释的忠实性与洞察力。

Details Motivation: 现有深度分割模型多为黑箱,缺乏可解释性,而当前解释技术主要面向分类任务,分割领域的可解释性研究相对不足。 Method: 构建基于因果推理的分割解释模型,将平均处理效应(ATE)反向传播至量化指标,以评估输入区域及网络组件对目标分割区域的影响。 Result: 在两个典型医学影像数据集上,该方法比现有分割可解释方法提供更忠实的解释;因果分析揭示了不同分割模型及同一模型不同输入间感知策略存在显著异质性。 Conclusion: 所提方法不仅提升了分割模型的可解释性,还能为模型优化提供有价值洞见,具备临床可信度提升潜力。 Abstract: Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at https://github.com/lcmmai/PdCR.

[46] Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan,Yusuf Sahin,Yasaman Haghighi,Sebastian Stapf,Pablo Acuaviva,Alexandre Alahi,Paolo Favaro

Main category: cs.CV

TL;DR: 本文提出COMiT框架,通过受人类交流启发的迭代式局部图像裁剪观察和递归更新离散表征,学习结构化视觉令牌序列,提升对象级语义结构和组合泛化能力。

Details Motivation: 现有离散图像分词器主要优化重建与压缩,导致令牌偏向局部纹理而非对象级语义结构;受人类渐进式、组合式交流方式启发,需构建更具语义和结构意义的视觉令牌。 Method: 提出COMmunication inspired Tokenization(COMiT):在固定令牌预算下,通过迭代观察局部图像块并递归更新离散表示;每步整合新视觉信息并重构已有令牌序列;最终消息驱动流匹配解码器重建图像;编码与解码集成于单个Transformer中,端到端联合训练,结合流匹配重建损失与语义表征对齐损失。 Result: 实验证明语义对齐提供语义基础,而注意力驱动的顺序分词对形成可解释、以对象为中心的令牌结构至关重要,并显著提升组合泛化与关系推理能力。 Conclusion: COMiT通过模拟人类交流机制,实现了更优的结构化视觉令牌学习,在语义性、可解释性与下游推理能力上优于先前方法。 Abstract: Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

[47] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Hanshen Zhu,Yuliang Liu,Xuecheng Wu,An-Lan Wang,Hao Feng,Dingkang Yang,Chao Feng,Can Huang,Jingqun Tang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出TextPecker方法,通过构建带字符级结构异常标注的数据集和基于笔画编辑的合成引擎,提升多模态大模型与OCR模型对文本结构异常的感知能力,从而在文本到图像生成中实现更可靠的视觉文本渲染。

Details Motivation: 现有文本到图像生成模型在视觉文本渲染(VTR)中常出现结构异常(如扭曲、模糊、错位),但主流MLLMs和OCR模型难以准确识别这些异常,导致评估和强化学习优化受阻。 Method: 提出TextPecker——一种即插即用的、能感知结构异常的强化学习策略;构建字符级结构异常标注数据集;开发基于笔画编辑的合成引擎以增强结构错误覆盖。 Result: TextPecker显著提升多种文本到图像模型性能,在Qwen-Image上使中文文本的结构保真度平均提升4%,语义对齐提升8.7%,达到高保真VTR新SOTA。 Conclusion: TextPecker填补了VTR优化中的关键空白,为实现可靠、结构忠实的视觉文本生成提供了基础性方案。 Abstract: Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

[48] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon,MAsato Ishii,Wei-Yao Wang,Koichi Saito,Akio Hayakawa,Dongseok Shim,Zhi Zhong,Shuyang Cui,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: 本文提出MMHNet,一种多模态分层网络,通过结合分层方法和非因果Mamba,实现从短时视频训练到长时音频(>5分钟)生成的泛化,显著超越现有视频到音频生成方法。

Details Motivation: 解决多模态对齐中数据有限及文本描述与帧级视频信息不匹配的问题,尤其是视频到音频生成任务中模型难以扩展至长时音频生成的挑战。 Method: 提出MMHNet,即多模态分层网络,作为当前视频到音频模型的增强扩展;整合分层建模与非因果Mamba架构,支持长时音频生成,并验证‘短训长测’可行性。 Result: 在长视频到音频基准上显著优于先前方法,成功生成超5分钟音频,而以往方法无法胜任长时生成任务。 Conclusion: 证明了在视频到音频生成任务中,仅用短时样本训练即可泛化至长时音频生成,MMHNet为多模态长时生成提供了有效新范式。 Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

[49] From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu,Maojia Song,Yihuai Lan,Lei Wang,Zhiqiang Hu,Yao Xiao,Heng Zhou,Weihua Zheng,Dylan Raharja,Soujanya Poria,Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: 本文提出了CHAIN基准测试,旨在评估视觉语言模型在物理结构理解、规划和执行方面的能力,强调了现有模型在处理动态环境中的几何、接触和支持关系时的不足。

Details Motivation: 现有视觉语言模型的评估主要集中在结构无关的单轮设置上,无法有效评估模型在动态环境中基于物理约束进行推理和行动规划的能力。 Method: 提出了一个名为CHAIN的交互式3D物理驱动基准测试,涵盖机械拼图、3D堆叠与装箱等任务,并在统一的交互设置下对前沿视觉语言模型和扩散模型进行了综合研究。 Result: 实验结果表明,当前性能最佳的模型仍难以内化物理结构和因果约束,在长周期规划和将感知到的结构转化为有效行动方面表现不佳。 Conclusion: CHAIN基准揭示了现有模型在物理世界理解和交互能力上的显著局限性,为未来研究提供了新的评估方向和挑战。 Abstract: Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

[50] OmniOCR: Generalist OCR for Ethnic Minority Languages

Bonan Liu,Zeyu Zhang,Bingbing Meng,Han Wang,Hanshuo Zhang,Chengping Wang,Daji Ergu,Ying Cai

Main category: cs.CV

TL;DR: 本文提出OmniOCR框架,通过动态低秩自适应(Dynamic LoRA)和稀疏正则化,有效解决少数民族文字OCR在低资源/零样本场景下的泛化难题,在多个少数民族文字数据集上显著提升准确率并保持高参数效率。

Details Motivation: 少数民族文字OCR面临书写系统复杂、标注稀缺、历史与现代形式多样等挑战,导致现有方法在低资源或零样本设置下泛化能力差。 Method: 提出OmniOCR通用框架,核心包括Dynamic LoRA(跨层与脚本动态分配模型容量)和稀疏正则化(剪枝冗余更新),实现高效、紧凑的适配且不增加推理开销。 Result: 在TibetanMNIST、Shui、古彝文、东巴文四个数据集上,OmniOCR显著优于零样本基础模型和标准后训练方法,准确率较SOTA基线提升39%–66%,同时具备更高参数效率。 Conclusion: OmniOCR为少数民族文字OCR提供了高效、可扩展的通用解决方案,验证了动态适配与稀疏优化在低资源多脚本场景下的有效性。 Abstract: Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: https://github.com/AIGeeksGroup/OmniOCR.

[51] OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen,Zeyu Zhang,Xingdou Bian,Hongjie Zhu,Lulu He,Layi Shama,Daji Ergu,Ying Cai

Main category: cs.CV

TL;DR: 本文提出了一种新的迭代式自校正框架OCR-Agent,通过能力反思与记忆反思提升大视觉语言模型(VLMs)在多轮修正中的推理鲁棒性,无需额外训练即在OCRBench v2上超越现有SOTA。

Details Motivation: 现有大视觉语言模型缺乏有效自校正机制,易陷入重复无效的多轮修正,难以稳定提升答案质量。 Method: 提出包含能力反思(诊断错误并生成修正计划)和记忆反思(回顾历史尝试以避免重复、探索新解)的迭代自校正框架,并结合严格重推理优化答案。 Result: 在OCRBench v2基准上,OCR-Agent英文子集超InternVL3-8B 2.0分、中文子集超1.2分;视觉理解达79.9、推理达66.5,均为SOTA。 Conclusion: 结构化、具备自我意识的反思机制可显著增强VLMs的推理鲁棒性,且无需额外训练。 Abstract: Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.