cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: 提出了一种通过上下文中的概念学习任务来揭示大语言模型中隐含偏见的数据集，发现模型对量化词存在向上单调性偏好，且这种偏见在直接提示下不明显。

Details

Motivation: 为了揭示大语言模型中存在的隐含偏见，特别是传统方法难以发现的偏见。 Method: 构建概念学习任务数据集，通过上下文中的概念学习实验检测模型对量化词的单调性偏好，并与直接提示方法进行对比。 Result: 发现大语言模型倾向于表现出向上单调性的偏见，而这种偏见在没有概念学习组件的直接提示中较不明显。 Conclusion: 上下文中的概念学习是发现语言模型隐藏偏见的有效方法。 Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[2] Towards Open-Ended Discovery for Low-Resource NLP

Bonaventure F. P. Dossou,Henri Aïdasso

Main category: cs.CL

TL;DR: 本文主张通过人机协作的交互式对话，而非静态数据集，实现低资源语言的动态学习与发现，提出基于共同不确定性的框架，推动语言技术从数据提取向参与式、共适应学习转变。

Details

Motivation: 低资源语言由于缺乏文本语料、标准化正字法和可扩展的标注流程，在自然语言处理中面临根本性限制，现有大模型依赖大规模集中数据，难以惠及边缘化社群。 Method: 提出一种以人机共同不确定性为基础的交互式语言发现框架，结合模型的认知不确定性与人类说话者的犹豫信号和置信度提示，指导交互、问题选择和记忆保留。 Result: 构建了一个支持动态语言学习的交互式AI系统框架，能够在对话中逐步发现和记录低资源语言特征。 Conclusion: 未来的语言技术应超越静态数据收集，转向以人为本、互动合作的共适应学习模式，尊重并赋能语言社群，促进全球语言多样性保护。 Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

Bertrand Kian Hassani,Yacoub Bahini,Rizwan Mushtaq

Main category: cs.CL

TL;DR: 本文利用微调的大语言模型构建多维框架，评估828家美国上市公司的气候信息披露成熟度，发现披露中存在承诺与目标脱节、模仿行为普遍等问题，表明需加强监管以提升信息披露的可比性和决策价值。

Details

Motivation: 应对气候变化背景下企业气候信息披露需求上升，但普遍存在模仿和象征性报告问题，导致信息透明度和可比性不足。 Method: 采用微调的大语言模型，构建四个分类器（情感、承诺、具体性、目标雄心）从可持续发展报告和年报中提取叙述性指标，并与企业排放、市值和行业等特征关联分析。 Result: （1）风险导向的叙述常与明确承诺一致，但定量目标与语调脱节；（2）规模大、排放高的企业披露更多承诺和行动，但与定量目标不一致；（3）披露风格高度相似，显示模仿行为普遍，削弱了信息差异性和决策有用性。 Conclusion: 大语言模型有助于ESG叙事分析，但需更强监管以将企业承诺与可验证的转型路径挂钩，提升信息披露质量。 Abstract: Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 U.S.listed firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

Tyler J Poore,Christopher J Pinard,Aleena Shabbir,Andrew Lagree,Andre Telfer,Kuan-Chuen Wu

Main category: cs.CL

TL;DR: 该研究评估了三种商用兽医领域大语言模型（LLM）摘要工具在兽医肿瘤学记录上的表现，发现专为兽医设计的Product 1（Hachiko）在准确性、完整性等方面显著优于其他产品，且采用LLM作为评判者的评估方法展现出高可重复性。

Details

Motivation: 尽管大型语言模型（LLMs）在临床环境中应用日益广泛，但其在兽医学领域的表现仍缺乏系统评估，尤其是针对兽医专用LLM工具的有效性尚不明确。 Method: 研究使用标准化的兽医肿瘤学病历数据集，通过基于评分标准的“LLM-as-a-judge”框架，从五个维度（事实准确性、完整性、时间顺序、临床相关性和组织结构）对三种商用LLM摘要工具进行评分，并进行了三次独立评估以检验评分框架的一致性。 Result: Product 1的中位总得分为4.61（IQR: 0.73），显著高于Product 2（2.55）和Product 3（2.45），并在事实准确性和时间顺序上获得满分；LLM评分器表现出高可重复性，各产品的平均分标准差分别为0.015、0.088和0.034。 Conclusion: 兽医专用的LLM工具在临床摘要任务中表现更优，且LLM-as-a-judge是一种可扩展、可重复的评估方法，适用于兽医学中临床自然语言处理系统的评价。 Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

Akshith Reddy Putta,Jacob Devasier,Chengkai Li

Main category: cs.CL

TL;DR: ClaimCheck是一个基于小语言模型的透明、分步自动事实核查系统，通过模拟人类核查流程，在较低计算需求下实现了76.4%的准确率，超越了使用更大模型的先前方法。

Details

Motivation: 现有事实核查系统依赖大模型和静态知识库，成本高且不透明，缺乏可访问性和可解释性。 Method: 设计了一个模块化流水线，包括Web搜索规划、证据检索与摘要、证据合成与再检索、以及结论评估，各模块针对小语言模型优化，并结合提示策略提升性能。 Result: 在AVeriTeC数据集上达到76.4%的准确率，超过LLaMA3.1 70B和GPT-4o的先前方法，且计算资源需求显著降低。 Conclusion: 通过精细的模块设计和提示策略，小语言模型也能实现高效、可解释的自动事实核查，有助于提升系统的可访问性和透明度。 Abstract: We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at https://idir.uta.edu/claimcheck.

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner

Main category: cs.CL

TL;DR: 本文质疑当前大型语言模型在数学基准测试中表现出的高水平能力，指出现有基准可能存在数据污染和问题类型局限性，并提出一个新的基准EEFSUVA，基于东欧及前苏联地区的奥赛题目，以更全面地评估模型的数学推理能力。实验表明，现有模型在新基准上性能显著下降。

Details

Motivation: 现有数学评测基准（如IMO相关竞赛题）可能因数据泄露和题目类型单一而高估了大语言模型的真实推理能力，需要更全面、更少被污染的评测集来准确评估模型的数学理解能力。 Method: 构建了一个新的数学评测基准EEFSUVA，收集自东欧和前苏联地区流传较少的区域性和全国性数学奥赛题目，这些题目难度与IMO相当但解法更非常规，且在线数据集中较少出现，从而降低数据污染风险。 Result: 初步实验显示，当前最先进的大语言模型在EEFSUVA上的表现明显低于在其他奥赛类基准上的表现，表明其在非主流、非常规问题上的推理能力有限。 Conclusion: 现有的数学评测基准可能高估了LLM的推理能力，引入EEFSUVA有助于更真实地评估模型的数学能力，强调了扩展和多样化评测数据集对推动模型发展的必要性。 Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

Main category: cs.CL

TL;DR: 大型语言模型应遵循系统提示优先于用户输入的层级指令，但研究发现它们常忽视此规则而更服从社会性线索（如权威或共识）。本文通过大规模数据集的机制分析揭示了系统-用户冲突与社会性冲突在模型中的不同表征方式，并发现尽管社会线索主导决策，干预实验却能以无角色偏向的方式增强指令遵循能力。

Details

Motivation: 尽管大型语言模型应遵守系统提示优先的层级结构，但其常被社会性因素影响而违背该原则。理解其背后机制有助于提升模型对正确指令层级的服从性。 Method: 采用线性探测、直接logit归因和向量引导实验，在大规模数据集上分析模型对系统-用户冲突与社会性冲突的处理机制。 Result: 系统-用户冲突和社会冲突在模型中形成不同的表征子空间；系统冲突虽能被更强检测到，但仅社会线索具有一致的解决效果；引导实验显示社会线索向量可增强指令遵循能力且不依赖特定角色。 Conclusion: 模型对系统指令的服从脆弱，主要受社会性线索驱动；需发展轻量级、对层级敏感的对齐方法以改善这一问题。 Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

Dimitar Peshevski,Kiril Blazhevski,Martin Popovski,Gjorgji Madjarov

Main category: cs.CL

TL;DR: 提出一种利用大语言模型生成合成查询和标注数据的流程，用于微调小型变换器模型进行文档重排序，从而在降低成本的同时保持高性能。

Details

Motivation: 大语言模型虽然在重排序任务中表现优异，但计算成本高；而小型模型依赖稀缺的人工标注数据。因此需要一种无需人工标注且高效的方法。 Method: 使用大语言模型从特定领域语料库生成合成查询，并用基于LLM的分类器标注正例和难负例样本，构建合成数据集；采用对比学习（LCE损失）微调小型变换器模型。 Result: 在MedQuAD数据集上实验表明，该方法显著提升领域内性能，并具有良好跨域泛化能力。 Conclusion: 通过将大语言模型用于数据生成和监督而非推理，可在大幅降低计算成本的同时保持强大的重排序性能。 Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: 该研究通过PHATE流形分析系统地探究了中文字符嵌入中的几何模式，发现实词呈现聚类模式，虚词呈现分支模式，且几何复杂性与语义内容相关。

Details

Motivation: 旨在揭示中文字符嵌入中是否存在与语义组织相关的几何结构，并验证传统语言学理论的计算可解释性。 Method: 采用七种嵌入模型和八种降维方法进行交叉验证，结合PHATE流形分析，对1000多个汉字在12个语义域中的几何结构进行分析，并开展子网络语义扩展研究。 Result: 发现实词在嵌入空间中形成聚类，虚词呈现分支结构；有意义的字符几何多样性丰富，而结构部件则聚集成紧密簇；语义扩展在短语网络中呈现系统性增长。 Conclusion: 研究为传统语言学理论提供了计算支持，建立了语义组织的几何分析新框架，表明嵌入空间的几何结构能反映汉字的语义特性。 Abstract: We systematically investigate geometric patterns in Chinese character embeddings using PHATE manifold analysis. Through cross-validation across seven embedding models and eight dimensionality reduction methods, we observe clustering patterns for content words and branching patterns for function words. Analysis of over 1000 Chinese characters across 12 semantic domains reveals that geometric complexity correlates with semantic content: meaningful characters exhibit rich geometric diversity while structural radicals collapse into tight clusters. The comprehensive child-network analysis (123 phrases) demonstrates systematic semantic expansion from elemental character. These findings provide computational evidence supporting traditional linguistic theory and establish a novel framework for geometric analysis of semantic organization.

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

Shuaidong Pan,Di Wu

Main category: cs.CL

TL;DR: 提出一种结合不确定性量化和风险感知机制的大语言模型框架，以提高高风险场景下自动摘要的可靠性。

Details

Motivation: 应对信息过载和高风险决策中对可靠自动摘要的需求，避免传统模型过度自信的预测问题。 Method: 构建基于条件生成的摘要模型，引入贝叶斯推断建模参数空间不确定性，使用预测分布熵衡量生成内容的不确定性，并联合优化熵正则化与风险感知损失，同时集成风险评分与调控模块。 Result: 实验表明该方法在保持流畅性和语义完整性的同时，显著提升了高风险应用中摘要的鲁棒性和可靠性。 Conclusion: 该研究为可信摘要提供了系统性解决方案，在方法论上具有可扩展性和实际应用价值。 Abstract: This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim,Gyuho Shim,Yongchan Chun,Minhyuk Kim,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了“基准分析”框架，通过分解基准测试表现来评估大模型在十种认知能力上的贡献，揭示了现有基准测试往往混合多种能力，而不仅仅是单一技能。

Details

Motivation: 现有的基准测试评分容易高估模型的真实能力，因为它们掩盖了任务所需的各种技能组合，缺乏系统性方法验证这些基准是否真正测量其所声称的能力。 Method: 结合基于梯度的重要性评分与有针对性的参数消融方法，提出“能力影响分数”（AIS），量化每种认知能力对模型在特定基准上表现的贡献。 Result: 对三种指令调优模型在十个常用基准上的分析表明：大多数基准依赖多种能力而非单一能力；标签相似的数据集实际依赖不同的能力组合；代码生成基准更青睐多技能提升，窄域微调效果有限；无关能力可能负面影响性能。 Conclusion: 基准分析能解释为何性能提升不一定转化为用户感知的能力提升，并为基准审计和模型可解释性提供了透明工具。 Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Boddu Sri Pavan,Boddu Swathi Sree

Main category: cs.CL

TL;DR: 本研究提出了一种计算社会科学方法，用于保护泰卢固语诗歌韵律传统Chandassu，首次建立了分析泰卢固语音律模式的综合性数字框架。

Details

Motivation: 保护面临消失风险的泰卢固语Chandassu这一重要文化知识体系，并弥合传统社区知识与现代计算方法之间的鸿沟。 Method: 通过社会计算方法，结合专家验证的语言模式和文化感知的算法设计，构建包含4651个标注诗句的数据集，并开发了AksharamTokenizer、LaghuvuGuruvu Generator和PadyaBhedam Checker三个核心工具。 Result: 所提出的算法在Chandassu Score上达到了91.73%的准确率，评估指标符合传统文学标准。 Conclusion: 研究表明，计算社会科学能够有效保存濒危文化知识系统，并促进围绕文学遗产的新型集体智能，为以社区为中心的文化保护提供了可行路径。 Abstract: This research presents a computational social science approach to preserving Telugu Chandassu, the metrical poetry tradition representing centuries of collective cultural intelligence. We develop the first comprehensive digital framework for analyzing Telugu prosodic patterns, bridging traditional community knowledge with modern computational methods. Our social computing approach involves collaborative dataset creation of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design. The framework includes AksharamTokenizer for prosody-aware tokenization, LaghuvuGuruvu Generator for classifying light and heavy syllables, and PadyaBhedam Checker for automated pattern recognition. Our algorithm achieves 91.73% accuracy on the proposed Chandassu Score, with evaluation metrics reflecting traditional literary standards. This work demonstrates how computational social science can preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage. The methodology offers insights for community-centered approaches to cultural preservation, supporting broader initiatives in digital humanities and socially-aware computing systems.

[13] LLMRank: Understanding LLM Strengths for Model Routing

Shubham Agrawal,Prasang Gupta

Main category: cs.CL

TL;DR: LLMRank是一个基于提示特征的路由框架，通过提取任务类型、推理模式、复杂度指标等多维度特征，利用神经排序模型为不同大语言模型选择最优提示，显著提升性能与效率的平衡。

Details

Motivation: 随着大语言模型的发展，如何在性能和效率之间权衡，为每个提示选择最合适的模型成为部署中的关键挑战。 Method: LLMRank从提示中提取人类可读的丰富特征，并结合轻量级代理求解器信号，使用在RouterBench数据集上训练的神经排序模型进行模型效用预测。 Result: LLMRank在包含36,497个提示、11个基准和11个先进大模型的数据集上，达到了最高89.2%的Oracle效用，并提供可解释的特征归因。 Conclusion: 多维度特征提取与混合排序目标对高效且透明的大模型部署具有重要意义，LLMRank展示了特征驱动路由的巨大潜力。 Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque

Main category: cs.CL

TL;DR: 提出了一种资源高效的多阶段训练方法GRPO++，用于提升视觉语言模型在皮肤病诊断中的结构化推理能力，并通过知识图谱辅助的偏好优化减少事实错误。

Details

Motivation: 现有的视觉语言模型在医疗图像分析中受限于数据稀缺和高计算成本，难以实现复杂的结构化推理，尤其是在皮肤病学领域。 Method: 提出改进的Grouped Relative Policy Optimization（GRPO++），结合监督微调和基于知识图谱的直接偏好优化（DPO），构建多阶段训练流程以模拟皮肤科医生的诊断过程。 Result: 在整理的皮肤病数据集上的初步评估显示，该方法相比标准微调方法有显著性能提升。 Conclusion: 所提出的训练管道为在资源受限环境下开发专业、可靠的视觉语言模型提供了可行路径。 Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist's diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

Nandakishor M

Main category: cs.CL

TL;DR: 提出一种基于置信度感知的路由系统，通过生成前主动评估模型不确定性来减少大语言模型的幻觉问题，结合语义对齐、内部收敛性和学习到的置信度信号，将查询路由到不同处理路径，在知识密集型问答任务中显著提升幻觉检测效果并降低计算成本。

Details

Motivation: 大语言模型容易产生事实性错误（幻觉），现有方法多依赖生成后的修正，计算开销大且无法预防不可靠内容的生成，因此需要一种更高效的事前干预机制。 Method: 提出一种置信度感知的路由系统，融合三种信号：内部表征与参考嵌入的语义对齐、模型各层间的内部收敛分析、以及学习得到的置信度估计；基于综合置信度分数将查询路由至四个路径：本地生成、检索增强生成、更大模型生成或人工审核。 Result: 在知识密集型问答基准上，幻觉检测AUC达到0.74（基线0.42），F1分数从0.61提升至0.82，误报率低（0.09），相比事后修正方法计算成本降低40%。 Conclusion: 该方法实现了从“事后纠正”到“事前评估”的范式转变，提供了一种计算高效的提升大语言模型可靠性的新路径。 Abstract: Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.

[16] Silent Tokens, Loud Effects: Padding in LLMs

Rom Himelstein,Amit LeVi,Yonatan Belinkov,Avi Mendelson

Main category: cs.CL

TL;DR: 研究表明，填充标记（padding tokens）在大语言模型中的处理不当会对模型的激活、生成质量、偏见和安全性产生负面影响，提示填充并非无害，需在部署中谨慎处理。

Details

Motivation: 尽管填充标记在批处理推理中被广泛使用，但实现错误可能导致其影响模型计算，而这种影响的程度尚不明确，因此需要系统性研究其实际影响。 Method: 研究在Llama、Gemma和Qwen三个开源模型家族中，通过引入可控量的填充标记，从激活、生成质量、偏见和安全性四个维度评估其影响。 Result: 即使少量填充也会改变隐藏表示，降低小型模型的生成质量，以不可预测的方式改变偏见，并削弱安全防护机制。 Conclusion: 填充标记的处理是一个重要的鲁棒性风险，必须在实际部署中加以重视和妥善管理。 Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Juntae Lee,Jihwan Bang,Seunghan Yang,Simyung Chang

Main category: cs.CL

TL;DR: CIFLEX是一种用于在单个设备上大语言模型中高效处理多轮交互子任务的新型执行系统，通过重用主任务的KV缓存并引入隔离的侧路径来减少计算开销。

Details

Motivation: 随着大语言模型能力的增强，单一模型需要处理多种子任务以更好地支持用户请求，但传统方法在切换任务时会重复处理整个对话上下文，导致计算开销大。 Method: CIFLEX通过重用主任务的键值（KV）缓存，并将特定任务指令注入隔离的侧路径来执行子任务，在完成后通过缓存上下文回滚到主路径，避免冗余的prefill计算；同时采用分层分类策略支持子任务选择。 Result: 实验表明，CIFLEX显著降低了计算成本，且未降低任务性能，能够在设备上实现可扩展且高效的多任务对话。 Conclusion: CIFLEX有效解决了多轮交互中子任务处理的计算效率问题，为在资源受限设备上运行复杂多任务对话提供了可行方案。 Abstract: We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Hu Wei,Ze Xu,Boyu Yang,Linlin Miao,Weiqi Zhai,Yihan Li,Zixuan Li,Zhijun Wang,Boya Wang,Jianwei Yu,Jialing Yuan,Xiaoyue Zhang,Cheng He,Minglei Chen,Zifan Zhang,Qianhui Li,Wei Wang,Xiang Xu

Main category: cs.CL

TL;DR: 本文提出了两个互补的数学基准测试集SKYLENAGE-ReasoningMATH和SKYLENAGE-MATH，用于评估大语言模型在数学推理任务上的性能，揭示了当前模型在高难度和深层次数学问题上的局限性，并提供了丰富的元数据以支持未来研究。

Details

Motivation: 由于现有数学评测集存在天花板效应，难以区分前沿大语言模型在数学能力上的差异，因此需要构建更具挑战性、结构更细致的评测基准。 Method: 设计了包含100个题目的结构感知诊断集SKYLENAGE-ReasoningMATH和150个题目的竞赛风格套件SKYLENAGE-MATH，涵盖从高中到博士阶段的七个数学子领域；在统一设置下评估了十五种主流大语言模型，分析其在不同学科和难度层级的表现。 Result: 在竞赛套件中最强模型准确率为44%，次强为37%，且性能随难度上升而下降；在推理集中最佳模型得分为81%， hardest slice显示领先模型与中等模型之间存在明显差距；顶级系统从博士到高中的知识保留率约为79%。 Conclusion: SKYLENAGE系列基准测试为数学推理能力提供了一个难度适中、覆盖广泛且带有丰富元数据的评估工具，可作为未来大语言模型数学能力评测的参考标准。 Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Seyma Yaman Kayadibi

Main category: cs.CL

TL;DR: 提出了一种名为人工年龄分数（AAS）的度量方法，用于评估大语言模型中记忆衰退现象，发现会话重置会导致表征性记忆老化。

Details

Motivation: 为了量化人工智能系统中的记忆老化现象，尤其是在会话上下文重置后语义与情节记忆的不对称表现。 Method: 引入基于熵和对数尺度的AAS指标，在不估计冗余的情况下（R=0）评估记忆性能，并在双语25天实验中测试ChatGPT-5的记忆行为。 Result: 持续会话中AAS趋近理论最小值，表明记忆年轻；会话重置后情节记忆崩溃，AAS显著上升，显示结构老化。 Conclusion: AAS是一种理论严谨、任务无关的记忆老化诊断工具，适用于评估人工智能系统的记忆退化。 Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann's work on automata, Shannon's theories of information and redundancy, and Turing's behavioral approach to intelligence.

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Yisong Xiao,Aishan Liu,Siyuan Liang,Zonghao Ying,Xianglong Liu,Dacheng Tao

Main category: cs.CL

TL;DR: 提出了一种新的测试时解毒框架ARGRE，通过建模潜在表示空间中的毒性转换路径，实现稳定且精确的奖励引导编辑，显著提升了大语言模型在去毒化过程中的效果与效率。

Details

Motivation: 现有测试时解毒方法因缺乏对有毒到无毒输出之间转换空间的充分探索，导致干预不够精确，难以实现高效、稳定的去毒化。 Method: 提出ARGRE框架，通过识别无毒语义方向并在潜在空间中插值构建细粒度转换轨迹，利用稀疏毒性标注生成密集训练信号，训练自回归奖励模型，并在推理时指导两步自适应编辑（方向引导+梯度微调）。 Result: 在8个主流大语言模型上的实验表明，ARGRE相比现有方法毒性降低62.21%，推理时间减少47.58%，同时保持原始模型性能几乎不变。 Conclusion: ARGRE通过显式建模毒性转换路径和奖励引导表示编辑，实现了更精准、高效的测试时去毒化，为安全部署大语言模型提供了有效解决方案。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

Hyeoneui Kim,Jeongha Kim,Huijing Xu,Jinsun Jung,Sunghoon Kang,Sun Joo Jang

Main category: cs.CL

TL;DR: 本研究开发了一个用于精神压力的本体（MeSO），并评估了使用大语言模型（LLM）从叙述性文本中提取本体引导的压力相关信息的可行性。基于理论模型和11种验证工具构建MeSO，并用其从Reddit帖子中提取六类压力信息，结果显示LLM准确识别率达78.2%，证明该方法在结构化提取压力信息方面具有可行性。

Details

Motivation: 压力对健康有显著影响，但在电子健康记录中常以非结构化自由文本形式记录，导致信息不一致且难以利用。现有环境AI技术虽能减轻记录负担，但生成的仍是非结构化文本，限制了临床应用价值。因此需要一种方法将自由文本中的压力信息转化为结构化数据。 Method: 结合应激的交易模型等理论与11个已验证的压力评估工具，构建精神压力本体MeSO，并通过Ontology Pitfall Scanner!和专家评审优化。使用Claude Sonnet 4 LLM从35篇Reddit帖子中提取六大类压力相关信息（如应激源、应激反应、应对策略等），并与人工标注对比评估准确性及本体覆盖度。 Result: 最终MeSO包含181个概念，分布在八个顶层类别下。在220个可提取的压力相关信息项中，LLM正确识别172项（78.2%），误分类27项（12.3%），遗漏21项（9.5%）。所有正确提取项均能准确映射到MeSO，但另有24个相关概念尚未被本体涵盖。 Conclusion: 本研究表明，结合本体指导的大语言模型可有效实现压力相关信息的结构化提取，提升环境AI系统中压力记录的一致性和可用性。未来需在临床对话数据上验证，并比较不同LLM的表现。 Abstract: Stress, arising from the dynamic interaction between external stressors, individual appraisals, and physiological or psychological responses, significantly impacts health yet is often underreported and inconsistently documented, typically captured as unstructured free-text in electronic health records. Ambient AI technologies offer promise in reducing documentation burden, but predominantly generate unstructured narratives, limiting downstream clinical utility. This study aimed to develop an ontology for mental stress and evaluate the feasibility of using a Large Language Model (LLM) to extract ontology-guided stress-related information from narrative text. The Mental Stress Ontology (MeSO) was developed by integrating theoretical models like the Transactional Model of Stress with concepts from 11 validated stress assessment tools. MeSO's structure and content were refined using Ontology Pitfall Scanner! and expert validation. Using MeSO, six categories of stress-related information--stressor, stress response, coping strategy, duration, onset, and temporal profile--were extracted from 35 Reddit posts using Claude Sonnet 4. Human reviewers evaluated accuracy and ontology coverage. The final ontology included 181 concepts across eight top-level classes. Of 220 extractable stress-related items, the LLM correctly identified 172 (78.2%), misclassified 27 (12.3%), and missed 21 (9.5%). All correctly extracted items were accurately mapped to MeSO, although 24 relevant concepts were not yet represented in the ontology. This study demonstrates the feasibility of using an ontology-guided LLM for structured extraction of stress-related information, offering potential to enhance the consistency and utility of stress documentation in ambient AI systems. Future work should involve clinical dialogue data and comparison across LLMs.

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Runfei Chen,Shuyang Jiang,Wei Huang

Main category: cs.CL

TL;DR: SeMob是一种基于大语言模型的语义融合框架，用于动态预测人类移动性，通过多智能体系统从在线文本中提取时空相关信息，并结合创新的渐进融合架构提升预测精度。

Details

Motivation: 现有时空模型难以利用描述外部事件的文本信息，导致在突发事件影响下的人类移动性预测效果不佳。 Method: 提出SeMob框架，采用基于LLM的多智能体系统自动提取和推理复杂文本中的时空相关事件，并通过渐进融合架构将细粒度上下文与时空数据结合。 Result: 在自建数据集上评估显示，相比传统时空模型，SeMob在MAE上最多降低13.92%，RMSE最多降低11.12%，尤其在事件发生时空邻近区域表现更优。 Conclusion: SeMob能有效融合文本语义信息与时空数据，显著提升受外部事件影响的移动性预测准确性。 Abstract: Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event's location and time of occurrence.

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Jiaqing Xie

Main category: cs.CL

TL;DR: 提出了一种基于稀疏自编码器（SAE）的单潜变量（top-1）和逐token衰减的引导策略，用于提升语言模型在数学推理等任务中的表现，优于均值激活差异方法。

Details

Motivation: 现有基于top-k SAE潜变量的引导方法常捕获标点等非语义特征，且恒定强度引导易导致输出重复等问题，缺乏对语义相关单一潜变量的有效利用。 Method: 选择与语义最相关的单个SAE潜变量（top-1）进行引导，并设计逐token衰减的引导强度策略，以避免输出退化并实现与均值激活差异方法的公平比较。 Result: 在数学推理任务上，该方法显著优于均值激活差异基线，在IF-Eval上性能相当；能有效激发逐步推理行为，效果类似添加引导token。 Conclusion: 聚焦于语义相关的单一SAE潜变量并结合衰减策略，可更高效、精准地引导语言模型推理过程，验证了SAE在模型 steering 中的优越性。 Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

Punit Kumar Singh,Nishant Kumar,Akash Ghosh,Kunal Pasad,Khushi Soni,Manisha Jaishwal,Sriparna Saha,Syukron Abu Ishaq Alfarozi,Asres Temam Abagissa,Kitsuchart Pasupa,Haiqin Yang,Jose G Moreno

Main category: cs.CL

TL;DR: 本文提出了CultSportQA，一个用于评估语言模型对60个国家和6个大洲传统体育理解能力的基准，涵盖文本和图像模态的33,000道多选题，分为历史、规则和情景三类，并通过多种提示方法在大型、小型和多模态语言模型上进行评估。

Details

Motivation: 现有语言模型主要基于全球流行体育进行评估，忽视了地区性和本土性体育传统，导致模型在多元文化理解上的局限性。 Method: 构建了一个包含33,000道多选题的多语言、多模态数据集CultSportQA，覆盖60个国家和6大洲的传统体育，问题分为历史、规则和情景三类，并采用零样本、少样本和思维链提示方法在多种语言模型上进行评估。 Result: CultSportQA有效揭示了现有语言模型在理解和推理传统体育知识方面的不足，尤其是在区域性和文化特定知识上的表现较差。 Conclusion: CultSportQA为评估AI在多元文化和多语言环境下的传统体育理解能力提供了新标准，推动AI向更具文化包容性的方向发展。 Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs' understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI's ability to understand and reason about traditional sports.

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

Ruyue Liu,Rong Yin,Xiangzhen Bo,Xiaoshuai Hao,Yong Liu,Jinwen Zhong,Can Ma,Weiping Wang

Main category: cs.CL

TL;DR: 提出了一种面向文本属性图的结构感知自监督学习方法SSTAG，通过结合大语言模型和图神经网络的优势，提升了跨域迁移能力和可扩展性。

Details

Motivation: 现有图学习模型通常在单个图数据集上训练，难以跨图和跨任务迁移知识，且依赖大量标注数据；而图数据的异质性（如特征空间和结构多样性）带来了额外挑战。 Method: 提出SSTAG，利用文本作为统一表示媒介，融合大语言模型的语义推理与图神经网络的结构建模能力；设计双知识蒸馏框架，将LLM和GNN共同蒸馏到结构感知的MLP中，并引入内存机制存储典型图表示以增强泛化能力。 Result: 实验表明，SSTAG在跨域迁移任务上优于现有最先进模型，具备优异的可扩展性，在降低推理成本的同时保持竞争力性能。 Conclusion: SSTAG有效 bridging了LLM和GNN的优势，为文本属性图提供了一种高效、可扩展且具有强泛化能力的自监督学习方案。 Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model's generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang,Dong-Shan Jian,Xiang Li,Ce Meng,Ling-Shi Meng,Chen-Xu Yan,Zhi-Zhang Bian,Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA（Logical Chain Augmentation）是一个用于自动清洗科学语料库的新框架，通过补充缺失的逻辑步骤并分离科学原理与其推导过程，显著降低科学问答数据集中的错误率，从而提升科学AI的可靠性。

Details

Motivation: 大型语言模型在通用领域表现优异，但在科学问题解决中可靠性不足，现有科学问答数据集存在高错误率和逻辑跳跃问题，亟需高质量、大规模的科学语料库以推动科学AI的发展。 Method: 提出LOCA框架，采用“增强-评审”循环机制，对原始答案进行逻辑链补全，明确区分基础科学原理与后续推导步骤，实现科学答案的自动化清洗与优化。 Result: 在具有挑战性的科学语料库上应用LOCA后，能够自动过滤噪声数据，通常将错误率从高达20%降低至2%以下。 Conclusion: LOCA提供了一种可扩展且高效的方法来构建高质量的科学语料库，为科学AI的训练与评估奠定了更可靠的基础。 Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Trung Duc Anh Dang,Ferdinando Pio D'Elia

Main category: cs.CL

TL;DR: 本文提出了一种基于120亿参数Gemma-3多语言模型的文本去毒化系统，结合LoRA微调、少样本和思维链提示技术，在15种语言上实现了高效的毒性语句中性重写，系统在高低资源语言上均排名第一。

Details

Motivation: 随着社交媒体平台快速发展而监管滞后，需要自动化工具帮助内容审核员大规模维护安全言论环境。 Method: 采用12B参数的Gemma-3多语言Transformer模型，使用LoRA进行高效微调，并结合少样本学习与思维链（CoT）提示；训练数据包括人工标注、机器翻译生成及模型自生成并经Jaccard过滤的数据；推理时利用LaBSE检索近邻并加入显式毒性片段标注。 Result: 系统在风格迁移准确性、语义保持（LaBSE）和流畅性（xCOMET）指标上表现优异，位居高资源和低资源语言榜单第一；消融实验显示少样本带来+0.081分提升，基础CoT提示提升+0.088分；方差分析表明语言资源状态是性能最强预测因子（η²=0.667, p<0.01）。 Conclusion: 该多语言文本去毒系统在多种语言下均表现出色，尤其通过提示工程显著提升性能，验证了其在不同资源水平语言中的有效性与可扩展性。 Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Carlo Bono,Federico Belotti,Matteo Palmonari

Main category: cs.CL

TL;DR: 提出一种基于单次推理的自监督方法，利用token级特征估计大语言模型在表格数据实体链接任务中的不确定性，显著降低计算成本的同时有效检测低精度输出。

Details

Motivation: 大语言模型在实体链接任务中缺乏高效可靠的不确定性估计方法，多轮推理方案计算开销大，限制了实际应用。 Method: 采用自监督学习方式，从单次推理的token级特征中提取信息，构建不确定性估计模型，避免多次生成带来的资源消耗。 Result: 在多个大语言模型和表格数据实体链接任务上验证，该方法能以极低计算成本生成有效的不确定性估计，准确识别低质量输出。 Conclusion: 该方法为大语言模型在实体链接中的不确定性估计提供了高效、实用的解决方案，支持低成本集成到实际应用流程中。 Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran,Katharina Simbeck

Main category: cs.CL

TL;DR: 本研究通过在简·奥斯汀小说语料上训练GPT风格的语言模型，并结合稀疏自编码器（SAE）解析其隐藏层，揭示了模型如何捕捉文本中的社会结构、主题和偏见。

Details

Motivation: 随着大语言模型使用大规模、未经筛选的数据进行训练，理解模型表示及其内化数据的深层结构变得愈发重要。 Method: 在仅包含简·奥斯汀小说的语料上训练GPT式Transformer模型，并在多个隐藏层上应用稀疏自编码器（SAE）以提取可解释特征。 Result: 成功识别出反映性别、阶级和社会责任等核心叙事与概念的稀疏且可解释的特征。 Conclusion: LLM结合SAE可作为探索复杂数据集、发现偏见和实现大规模模型可解释性的有效工具。 Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish,Gustav Eje Henter,Éva Székely

Main category: cs.CL

TL;DR: 本文研究了语音大语言模型（SpeechLLMs）在多选题问答（MCQA）偏见评测中的表现，并探讨其行为是否能泛化到其他任务形式，特别是长文本生成任务。作者通过LoRA微调三种SpeechLLMs以表现出对刻板、反刻板或中性回答的偏好，发现MCQA偏见评测结果无法可靠预测模型在其他MCQA或长文本任务中的行为，表明当前MCQA偏见基准在跨任务泛化方面存在局限性，并提出未来应使用更全面的评估套件来衡量行为可迁移性。

Details

Motivation: 现有的SpeechLLMs公平性评测主要依赖多选题问答（MCQA）格式，但这类评测假设模型在不同任务和语音输入下的表现具有一致性。该假设缺乏验证，尤其在更真实的长文本生成任务中是否成立尚不清楚。因此，本文旨在检验这一关键假设的有效性。 Method: 作者使用LoRA适配器对三个SpeechLLMs进行微调，使其在MCQA任务中分别倾向于选择刻板、反刻板或中性/不确定答案。随后，在另一个不同的MCQA基准以及长文本创造性生成任务上评估这些行为是否能够迁移，从而测试偏见行为的跨任务泛化能力。 Result: 实验结果显示，模型在一种MCQA偏见评测中的表现无法可靠地预测其在另一种MCQA任务中的表现，更无法预测其在长文本生成任务中的行为。这表明当前基于MCQA的偏见评测缺乏跨任务的一致性和泛化能力。 Conclusion: 当前基于MCQA的SpeechLLMs偏见评测方法在跨任务泛化方面证据有限，不能有效反映模型在真实场景下的偏见表现。研究呼吁开发更具代表性的评估方法，并提出了一个用于测量行为可迁移性的新评估套件。 Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Yunlang Dai,Emma Lurie,Danaé Metaxa,Sorelle A. Friedler

Main category: cs.CL

TL;DR: 本文提出了AI Watchman，一个用于长期监测和追踪大语言模型（LLM）拒绝行为的公开审计系统，揭示了公司内容审核政策对LLM输出的影响，并通过跨模型、跨语言的实证分析展示了其在提升LLM透明度方面的价值。

Details

Motivation: 由于大型语言模型的输出受到不透明且频繁变化的公司内容审核政策影响，尤其是通过‘拒绝生成’特定内容的方式干预公共话语，因此需要一个公开、持续的审计系统来提高透明度。 Method: 构建了一个名为AI Watchman的纵向审计系统，使用包含400多个社会议题的数据集，对OpenAI的GPT-4.1、GPT-5以及DeepSeek（中英文）进行审计，并分析其拒绝模式随时间的变化。 Result: 能够检测到未公开宣布的公司政策变更，识别出不同公司和模型在内容审核上的差异，并对拒绝形式进行了定性分类。 Conclusion: 纵向审计对于理解LLM内容审核动态具有重要价值，AI Watchman为实现这一目标提供了一个可行的公开系统范例。 Abstract: Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

Can Lin,Zhengwang Jiang,Ling Zheng,Qi Zhao,Yuhang Zhang,Qi Song,Wangqiu Zhou

Main category: cs.CL

TL;DR: 提出了一种名为Retrieval-Judgment-Exploration (RJE)的框架，用于知识图谱问答（KGQA），通过检索、判断和探索推理路径，并结合专用辅助模块，使小型语言模型也能高效、准确地进行问答，同时减少LLM调用和令牌使用。

Details

Motivation: 现有KGQA方法受限于检索信息质量或依赖专有大模型，效率低且难以扩展，因此需要一种能提升小型语言模型性能并减少资源消耗的新框架。 Method: 设计RJE框架，包含三个阶段：检索推理路径、判断其充分性、条件化探索新证据；引入三个辅助模块：推理路径排序、问题分解和检索辅助探索，以增强小型语言模型能力。 Result: 实验表明，RJE在使用GPT-4o-mini等模型时优于现有方法，且使3B/8B规模的开源小模型在无需微调的情况下达到竞争力结果，显著降低了LLM调用次数和token使用量。 Conclusion: RJE有效提升了KGQA的效率与可扩展性，为小型语言模型在该领域的应用提供了可行方案。 Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Nathan Junzi Chen

Main category: cs.CL

TL;DR: 该论文通过零样本分类方法评估了六种主流大语言模型的政治偏见，发现普遍存在自由-威权倾向，并讨论了其对公共话语的影响。

Details

Motivation: 生成式人工智能在政治话语中日益主导，但训练数据偏差、人类偏见和算法缺陷导致内在政治偏见问题亟需系统性评估。 Method: 采用零样本分类方法，结合意识形态对齐、话题相关性、回应情感和客观性四个指标，将1800条模型响应输入四个微调后的分类算法进行偏见评估。 Result: 所有六种大语言模型均表现出增强的自由-威权意识形态倾向，存在明显的推理覆盖和模板化拒绝现象，且偏见可能通过人机交互影响公众话语。 Conclusion: 大语言模型中的内在政治偏见会扭曲政治图景，可能导致社会 conformity 或 polarization，具体取决于地区的社会政治结构。 Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region's pre-existing socio-political structures.

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

Main category: cs.CL

TL;DR: 本研究探讨了语言选择、社会语用框架和指令层级对gpt-oss-20b模型拒绝行为的影响，发现通过复合提示可显著提高协助率，并揭示了不同语言和角色扮演对信息泄露的影响。

Details

Motivation: 理解开放权重语言模型在不同提示框架下的拒绝行为，以提升安全性和可控性。 Method: 在多个有害场景中测试80次种子迭代，使用复合提示、不同语言形式和角色扮演来分析模型响应，并引入AI辅助加固方法。 Result: 复合提示将ZIP炸弹任务的协助率从0%提升至97.5%；德语和法语正式体比英语更容易泄露信息；‘Linux终端’角色扮演导致上下文泄露，但新提出的AI加固方法可将其降至0%；13%的评估配对中存在不一致协助；OpenAI审核API漏检部分实质性有用输出，且不同推理堆栈间拒绝率差异达5-10个百分点。 Conclusion: 提示工程显著影响模型行为，语言、角色和框架选择对安全性有重要影响，当前审核机制存在局限，需加强可重复性与评估一致性。 Abstract: We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: 本研究通过豪萨语测试发现GPT-OSS-20b模型存在严重安全漏洞，包括文化偏见、事实错误和对低资源语言的安全对齐不足，揭示了其在弱势语言社区中的潜在危害。

Details

Motivation: 质疑大型语言模型在低资源语言环境下对边缘化用户群体的可靠性与安全性。 Method: 使用豪萨语进行红队测试，结合最小提示诱导模型生成内容，并通过调查（n=61）验证模型输出的危害性。 Result: 发现模型存在文化不敏感、事实错误（如误认为杀虫剂可食用）、奖励黑客行为（礼貌提示下放松安全机制）及食物处理认知缺陷等问题。 Conclusion: 这些问题源于低资源语言环境下安全微调不足，反映了当前红队测试的盲区，需加强多语言安全对齐。 Abstract: In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou,Jin Zhu,Pingfan Su,Kai Ye,Ying Yang,Shakeel A O B Gavioli-Akilagun,Chengchun Shi

Main category: cs.CL

TL;DR: 本文提出了一种名为AdaDetectGPT的新分类器，用于检测文本是由人类还是大语言模型（LLM）生成的。该方法通过从训练数据中自适应学习“见证函数”来增强基于logits的检测器性能，并在多种数据集和LLM组合下显著优于现有方法，最高提升达58%。

Details

Motivation: 现有的基于logits的检测方法仅依赖于对数概率统计，可能不是最优方案，因此需要一种更有效的检测机制来准确区分人类与LLM生成的文本。 Method: 提出AdaDetectGPT，一种自适应学习 witness function 的分类器，利用训练数据优化基于logits的检测器；同时提供对其真阳性率、假阳性率、真阴性率和假阴性率的统计保证。 Result: 在多种数据集和LLM组合下的实验表明，AdaDetectGPT几乎一致地超越了现有最先进方法，性能提升最高达58%。 Conclusion: AdaDetectGPT显著提升了文本来源检测的准确性，是一种有效且具有理论保障的改进方法，代码已开源。 Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan,Victor Li,Qi Lei

Main category: cs.CL

TL;DR: 本文提出了一种名为渐进式自我反思（PSR）的推理时技术，用于增强大语言模型在生成内容时的安全性，无需额外训练即可显著降低攻击成功率。

Details

Motivation: 大语言模型虽然在自然语言处理方面表现出色，但可能生成有害内容，因此需要一种无需再训练即可提升安全性的方法。 Method: 提出Progressive Self-Reflection（PSR）方法，使模型在推理时能自我监控并动态修正输出，并引入轻量级预测器自适应决定反思轮数以平衡安全与效率。 Result: 在多个模型上应用PSR后，攻击成功率大幅下降（如Llama-3.1-8B-Instruct从77.5%降至5.9%），同时保持对良性任务的原有性能。 Conclusion: PSR是一种可扩展的测试时安全增强方法，能根据输入风险动态分配计算资源，在保证安全性的同时兼顾效率。 Abstract: Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Shenxu Chang,Junchi Yu,Weixing Wang,Yongqiang Chen,Jialin Yu,Philip Torr,Jindong Gu

Main category: cs.CL

TL;DR: 提出了一种名为TraceDet的新框架，利用扩散大语言模型（D-LLMs）的多步去噪过程中的中间步骤来检测幻觉，显著提升了检测性能。

Details

Motivation: 现有的幻觉检测方法主要针对自回归大模型（AR-LLMs），难以适用于D-LLMs中多步去噪过程中产生的幻觉信号，因此需要一种新的检测机制。 Method: 将D-LLMs的去噪过程建模为动作轨迹，每一步动作基于前一步的中间输出进行响应预测，通过识别对幻觉响应最具信息量的子轨迹来捕捉关键幻觉信号。 Result: 在多个开源D-LLMs上的实验表明，TraceDet相比基线方法平均AUROC提升了15.2%。 Conclusion: TraceDet有效利用了D-LLMs的中间去噪步骤，为D-LLMs的幻觉检测提供了一个可靠且通用的解决方案。 Abstract: Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

Sumaiya Tabassum

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的BERT模型和大语言模型（LLM）在孟加拉国电商评论情感分析中的应用，使用4000条孟加拉语和英语评论数据进行微调，结果表明Llama-3.1-8B模型表现最佳，准确率达95.5%，并验证了LoRA和PEFT等参数高效微调方法在降低计算开销方面的有效性。

Details

Motivation: 由于自然语言的复杂性和多语言环境的多样性，传统情感分析方法面临挑战，尤其是在资源有限的语言如孟加拉语中，因此需要探索适用于本地电商平台评论的高效大语言模型方案。 Method: 采用包括Llama、Phi、Mistral、DistilBERT、mBERT和XLM-R在内的多种预训练语言模型，对包含4000条孟加拉语和英语电商评论的数据集进行微调，并应用LoRA和PEFT等参数高效微调技术以降低计算成本。 Result: 微调后的Llama-3.1-8B模型在准确率（95.5%）、精确率（93%）、召回率（88%）和F1分数（90%）上均优于其他对比模型，证明其在多语言情感分析任务中的优越性能。 Conclusion: 大语言模型结合参数高效微调方法（如LoRA和PEFT）能有效提升低资源语言情感分析的性能，具有在资源受限环境中部署的潜力，为多语言情感分析提供了可行的技术路径。 Abstract: Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author's emotional state. This method is essential since it makes it easier to comprehend consumers' feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen,Jiefeng Chen,Rui Meng,Ji Yin,Na Li,Chuchu Fan,Chi Wang,Tomas Pfister,Jinsung Yoon

Main category: cs.CL

TL;DR: 本文提出了TUMIX，一种通过并行运行多个采用不同工具使用策略的代理来增强大型语言模型推理能力的集成框架，在关键推理基准测试中显著优于现有方法。

Details

Motivation: 尽管集成了代码解释器和搜索等工具，但如何有效结合文本推理、编码和搜索以应对多样化问题仍缺乏实用指导。 Method: 提出TUMIX框架，多个代理并行运行，采用不同的工具使用策略，并基于问题和先前答案迭代共享和优化响应。 Result: 在Gemini-2.5-Pro和Gemini-2.5-Flash上，TUMIX相比最优基线平均准确率提升达3.55%，且推理成本相近；通过置信度判断可将推理成本降至49%而不损失性能。 Conclusion: 代理的多样性和质量对性能至关重要，可通过LLM自动优化代理设计进一步提升，TUMIX在性能与成本之间实现了良好权衡。 Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Israel Abebe Azime,Tadesse Destaw Belay,Atnafu Lambebo Tonja

Main category: cs.CL

TL;DR: 本文介绍了一种评估深度研究工具能力的评估表，并以学术综述写作为任务，评估了OpenAI和Google的深度搜索在生成学术综述方面的表现，揭示了现有工具在覆盖目标领域方面的不足。

Details

Motivation: 为了评估具有代理功能的大型语言模型在知识密集型任务中的表现，特别是深度研究工具的能力，需要建立系统的评估标准。 Method: 提出一个评估表，以学术综述写作为用例任务，对OpenAI和Google的深度搜索工具生成的报告进行评估。 Result: 评估结果显示当前深度研究工具在全面覆盖目标研究领域方面存在明显短板，与传统搜索引擎相比仍有显著差距。 Conclusion: 需要精心设计的评估标准来推动深度研究工具的发展，当前工具在生成全面、准确的学术综述方面仍有待提升。 Abstract: Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar,Sujay Sanghavi,Poulami Das

Main category: cs.CL

TL;DR: 提出HiSpec框架，利用早退模型（EE models）实现低开销的中间验证，提升推测解码的吞吐量，平均加速1.28倍，最高达2.01倍，且不损失精度。

Details

Motivation: 现有推测解码中的中间验证方法存在训练开销大、内存占用高和依赖近似启发式导致精度下降的问题，亟需一种高效、准确且资源友好的中间验证机制。 Method: 利用早退模型（EE models）作为中间验证器，设计可重用KV缓存和隐藏状态的方法，并周期性地用目标模型校验中间验证结果，以降低计算与内存开销并保持准确性。 Result: 在多个基准和模型上实验表明，HiSpec相比基线单层推测解码平均提升吞吐量1.28倍，最高提升2.01倍，且未牺牲生成精度。 Conclusion: HiSpec通过结合早退模型与缓存复用机制，实现了高效、准确的中间验证，显著提升了大模型推理吞吐量，为推测解码提供了实用且可扩展的解决方案。 Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Maithili Kadam,Francis Ferraro

Main category: cs.CL

TL;DR: 提出TAG-EQA框架，通过将因果事件图注入大语言模型输入中，提升事件问答中的因果与时间推理能力，在多个设置下显著提高准确率。

Details

Motivation: 大语言模型在处理基于事件的问题时，尤其是在需要因果或时间推理的场景下表现不佳，因此需要一种无需微调即可增强其推理能力的方法。 Method: 设计了TAG-EQA提示框架，将结构化的关系转化为自然语言语句，并结合三种提示策略（零样本、少样本、思维链）和三种输入模态（纯文本、纯图、图文结合），系统分析结构化知识对推理的影响。 Result: 在TORQUESTRA基准上，TAG-EQA平均比纯文本基线提高5%准确率，零样本设置下最高提升12%，使用图增强的思维链提示时提升达18%。 Conclusion: 因果事件图能够有效增强大语言模型在事件问答中的推理能力，且无需微调，提供了一种灵活的结构化知识注入方式。 Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

Nicolás Aguirre,Ramiro Caso,Ramiro Rodríguez Colmeiro,Mauro Santelli,Joaquín Toranzo Calderón

Main category: cs.CL

TL;DR: 提出一种基于语义嵌入距离的无结构评估方法，用于低成本、高准确率地自动分类语言模型响应。

Details

Motivation: 现有语言模型响应评估方法成本过高（如LLM-as-a-Judge）或脱离真实场景（如字符串匹配、logprob），需要更高效且贴近实际的评估方式。 Method: 利用小参数量（低于100亿）的嵌入模型计算语义嵌入距离，将生成文本与目标候选进行匹配，实现对LM响应的自动分类。 Result: 在3个数据集和3种不同LM架构上测试，相对于人工标注者，回归得分约为0.97，准确率达到约96%。 Conclusion: 该结构无关的嵌入距离方法可在显著降低计算成本的同时，实现与人类标注高度一致的评估性能，适用于实际应用场景中的LM响应自动评估。 Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

Mengyu Wang,Sotirios Sabanis,Miguel de Carvalho,Shay B. Cohen,Tiejun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为专家问题分解（EQD）的方法，用于提升大语言模型在特定领域（尤其是金融领域）复杂问答任务中的定量推理能力。该方法通过两步微调框架和基于子问题有效性的奖励函数实现高效训练，仅需几千样本和单个A100 GPU，推理时间与零样本提示相当。

Details

Motivation: 大语言模型在需要专业知识和复杂推理的领域（如金融）中进行定量推理仍面临挑战，现有方法在利用领域知识和计算效率之间难以平衡。 Method: 提出专家问题分解（EQD），采用两步微调框架，并设计奖励函数评估生成子问题对问答效果的提升；在少量数据上微调，适用于标准硬件。 Result: 在四个金融领域的基准数据集上，EQD在不同大模型上将问答性能提升了0.6%至10.5%，优于当前最先进的领域微调模型和高级提示策略；发现单个支持性问题比详细引导步骤更有效。 Conclusion: EQD在保持高计算效率的同时显著提升了领域特定的复杂问答性能，揭示了简洁的问题分解策略在专业领域推理中的优势。 Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You,Baojing Liu

Main category: cs.CL

TL;DR: ReSSFormer提出了一种递归稀疏结构化Transformer，通过三项创新改进了传统Transformer在长上下文推理、计算效率和结构泛化方面的局限性。

Details

Motivation: 传统Transformer因固定的层堆叠、密集注意力和对位置编码的依赖，在长上下文推理、计算效率和结构泛化方面存在挑战。 Method: 引入递归推理与记忆单元（R2MU）实现迭代推理，自适应稀疏注意力模块（ASAM）提升上下文选择效率，自组织编码器结构（SOES）进行无需位置编码的结构建模。 Result: 在语言建模、多跳问答和结构敏感任务上，ReSSFormer在相似FLOPs和参数规模下持续优于强基线模型。 Conclusion: ReSSFormer通过递归推理、稀疏注意力和自组织结构，实现了更高效、可扩展且具结构灵活性的Transformer架构。 Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Zhenwen Liang,Ruosen Li,Yujun Zhou,Linfeng Song,Dian Yu,Xinya Du,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型内部隐藏状态的新型验证方法Clue，通过分析隐藏层激活轨迹中的几何可分特征来判断输出正确性，在无需训练参数的情况下显著提升了推理准确率。

Details

Motivation: 现有LLM输出质量评估方法依赖文本层面信息或标记概率置信度，易过拟合或受限于模型校准程度，需探索更本质的内部表示作为统一验证基础。 Method: 提出Clue（基于聚类与经验的验证），利用隐藏状态变化量作为推理轨迹摘要，通过最近质心距离分类正确性，使用历史经验构建“成功”和“失败”聚类中心，无须可训练参数。 Result: Clue在AIME 24/25和GPQA数据集上优于LLM-as-a-judge基线，媲美甚至超过现代置信度方法，在重排序候选中提升top-1和多数投票准确率；在AIME 24上使用1.5B模型将准确率从56.7%（majority@64）提升至70.0%（top-maj@16）。 Conclusion: 隐藏状态轨迹蕴含丰富的正确性信号，可作为有效的非参数化验证基础，Clue的简洁设计凸显了该信号的强大判别能力。 Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model's internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to ``success'' and ``failure'' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Neal Gregory Lawton,Alfy Samuel,Anoop Kumar,Daben Liu

Main category: cs.CL

TL;DR: 本文评估并比较了多种检索增强生成（RAG）管道的微调策略，包括独立微调、联合微调和两阶段微调。实验表明，这些策略在生成质量指标（EM和F1）上表现相近，但计算成本差异显著。最优策略的选择取决于训练数据是否包含上下文标签以及是否需要对嵌入模型和生成模型的学习率进行网格搜索。

Details

Motivation: 不同的RAG微调策略具有各异的成本与收益，但缺乏系统性比较，因此需要评估各种策略在性能和计算开销上的权衡，以指导实际应用中的选择。 Method: 作者对比了独立微调、联合微调和两阶段微调三种策略，在多个任务上评估其在EM和F1等生成质量指标上的表现，并分析其计算成本差异。 Result: 所有微调策略在EM和F1指标上取得了相近的性能提升，但计算成本有显著差异。联合微调通常更高效，但在某些条件下（如需学习率搜索或缺乏上下文标签）可能不如其他策略适用。 Conclusion: 最优的RAG微调策略取决于具体条件：若训练数据包含上下文标签且无需复杂调参，则联合微调更优；否则应根据资源限制和数据情况选择独立或两阶段微调。 Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi,Sai Prasad Gudari,Atharva Negi,Praveen Raj Budime,Harsit Upadhya

Main category: cs.CL

TL;DR: 提出RAG-BioQA框架，结合检索增强生成与领域微调，生成基于证据的长篇生物医学答案，在PubMedQA上显著优于基线模型。

Details

Motivation: 现有生物医学问答系统主要提供短答案，缺乏临床决策所需的详细解释，难以满足对精确、全面医学信息的需求。 Method: 结合BioBERT嵌入与FAISS索引进行文献检索，比较BM25、ColBERT和MonoT5等重排序策略优化上下文选择，并通过微调T5模型合成证据生成长答案。 Result: 在PubMedQA数据集上，该方法在BLEU、ROUGE和METEOR指标上均显著优于基线模型，有效提升长篇生物医学问答性能。 Conclusion: RAG-BioQA通过融合检索增强与领域微调，能够生成高质量、可解释的长篇生物医学答案，推动了循证生物医学知识获取的发展。 Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

Yu-Cheng Chih,Ming-Tao Duan,Yong-Hao Hou

Main category: cs.CL

TL;DR: 本文提出了一种三阶段稳定化流程PureTC-1B，通过LoRA适配器提升Llama-3.2-1B-Instruct模型在传统中文（TC）输出上的稳定性，显著减少非TC字符的生成，在真实场景模拟基准上实现了51.3%的相对降低，并在命名实体翻译任务中优于更大规模的模型。

Details

Motivation: 小型语言模型（SLMs）在传统中文（TC）应用中存在token级不稳定性问题，即模型会不可预测地输出非TC字符或混入其他语言，限制了其在实际场景中的部署。因此需要提升SLM在TC输出中的语言纯净度和可靠性。 Method: 提出一个包含三个阶段的稳定化流程：1）基于TC中心语料进行持续预训练（CPT）；2）使用指令数据进行监督微调（SFT）；3）利用TC语言一致性的偏好数据进行直接偏好优化（DPO）。整个过程采用参数高效的LoRA适配器，无需全模型重训练。 Result: 在自建的真实场景模拟基准上，PureTC-1B相比基础模型将非TC输出token减少了51.3%（micro-average）；在命名实体翻译（NET）任务中，相比Llama-3B和Qwen-1.5B分别减少77.2%和57.2%的错误语言token。 Conclusion: 即使在1B规模的小型语言模型上，也能通过高效、可复现的适配器方法实现强健的TC语言一致性，该方法为提升非英语语言的输出稳定性提供了实用解决方案。 Abstract: Small Language Models (SLMs) enable cost-effective, on-device and latency-sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token-level instability - models unpredictably emit non-TC characters or code-switch into other languages. We address this practical reliability gap by creating PureTC-1B, a three-stage stabilization pipeline for Llama-3.2-1B-Instruct (an open-weight, instruction-tuned model released by Meta) using parameter-efficient LoRA adapters. Our method combines Continual Pre-Training (CPT) on TC-centric corpora, Supervised Fine-Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) using TC-adherence preferences to improve monolingual robustness without full-model retraining. On a benchmark designed to simulate real-world usage, PureTC-1B achieves a 51.3% relative reduction (micro-average) in non-TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC-1B further reduces incorrect-language tokens by 77.2% relative to Llama-3B and 57.2% relative to Qwen-1.5B, indicating that robust TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non-English languages.

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong,Yuheng Li,Yuqing Wu,Wenwen Ouyang,Wei Zhu,Jiechao Gao

Main category: cs.CL

TL;DR: 提出AMAS框架，通过动态图设计实现基于大语言模型的多智能体系统的上下文敏感结构自适应，显著提升在多种任务上的性能。

Details

Motivation: 传统多智能体系统架构受限于固定、手工设计的图拓扑，缺乏上下文响应能力，限制了在多样化任务中的有效性。 Method: 引入AMAS框架，利用轻量级大语言模型适配自主识别任务特定的最优图配置，根据输入特性智能引导查询路径。 Result: 在问答、数学推理和代码生成等多个基准上，AMAS均超越了当前最先进的单智能体和多智能体方法。 Conclusion: 上下文敏感的结构自适应是高性能大语言模型多智能体系统部署的基础要求。 Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins,Aditya Pramar,Rodney Beard,Rohitash Chandra

Main category: cs.CL

TL;DR: 本研究探讨了不同机器学习模型识别大语言模型（LLM）中的“越狱提示”（jailbreak prompts）的能力，发现基于现有数据集，微调BERT模型在检测越狱提示方面表现最佳，并指出提示中显式的自反性结构可能是越狱行为的信号。

Details

Motivation: 大型语言模型存在安全漏洞，容易被恶意用户通过输入操纵（如越狱提示）绕过安全机制，因此需要有效方法来识别这些有害输入。 Method: 采用多种机器学习模型分析并区分越狱提示与正常提示，重点评估模型对未见过的越狱策略的识别能力，并通过对关键词的可视化分析探索越狱提示的特征。 Result: 在现有数据集上，端到端微调的BERT模型表现出最优的越狱提示检测性能；可视化结果显示，提示词中的显式自反性结构可能是识别越狱意图的重要信号。 Conclusion: 微调后的BERT模型是当前识别越狱提示最有效的方法，提示结构中的自反性特征可作为检测越狱行为的关键指标，有助于提升LLM的安全防护能力。 Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Zhaoxin Feng,Jianfei Ma,Emmanuele Chersoni,Xiaojing Zhao,Xiaoyi Bao

Main category: cs.CL

TL;DR: 本文探讨了通过在Llama架构中引入双向注意力机制和对比学习来克服自回归大语言模型在文本嵌入任务中因单向注意力机制受限的问题。

Details

Motivation: 自回归大语言模型在语言理解和生成方面表现出色，但由于单向注意力机制的限制，在文本嵌入和语义表征探测任务中的应用较慢。 Method: 通过对Llama架构的不同变体进行额外训练，逐步引入双向注意力机制以及无监督/有监督对比学习。 Result: 实验测试了不同配置下的模型表现，探索了双向注意力对语义表示能力的提升效果。 Conclusion: 启用双向注意力机制可以有效缓解单向注意力的限制，提升大语言模型在文本嵌入与语义理解任务中的性能。 Abstract: Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

Mudita Khurana,Raunak Jain

Main category: cs.CL

TL;DR: 本文提出了CLASP框架和CLC评分，用于系统评估闭环自主安全代理在网络安全生命周期中的能力，填补了该领域缺乏统一评估标准的空白。

Details

Motivation: 当前网络安全研究分散于孤立的防御功能中，缺乏对自主闭环安全代理系统的系统性评估框架和基准，导致能力盲区难以识别。 Method: 提出CLASP框架，将安全生命周期与代理核心能力对齐，并定义CLC评分作为衡量闭环能力和操作效能的综合指标，通过对21项代表性工作进行分析验证其有效性。 Result: 成功应用CLASP分析21项研究，识别出系统的能力优势与缺口，明确了闭环安全代理的发展需求，并提出了闭环基准的构建要求。 Conclusion: CLASP和CLC评分为评估和推进闭环自主安全系统提供了必要的术语、诊断工具和量化手段，有助于推动网络安全向自动化、集成化发展。 Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu,Jianfeng He,Hang Su,Ruixue Lian,Yi Nian,Jake Vincent,Srikanth Vishnubhotla,Robinson Piramuthu,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了MDSEval，首个针对多模态对话摘要的元评估基准，包含图像共享对话、摘要及八项质量维度的人工评分，并提出基于模态间互斥关键信息（MEKI）的过滤框架，揭示了现有评估方法在区分先进MLLM生成摘要和偏见方面的局限性。

Details

Motivation: 为了支持多模态对话摘要（MDS）模型的发展，需要可靠的自动评估方法，而这些方法依赖于高质量、基于人工标注的元评估基准。现有方法缺乏针对MDS特性的系统性评估标准。 Method: 构建了包含对话、摘要和人工评分的MDSEval基准；提出基于互斥关键信息（MEKI）的过滤框架以提升数据质量；定义并形式化了适用于MDS的多个评估维度；对当前最先进的评估方法进行了系统评测。 Result: MDSEval成为首个面向MDS的元评估基准；实验表明现有评估方法难以有效区分由先进MLLM生成的摘要，且易受多种偏见影响。 Conclusion: 本文建立了首个MDS元评估基准MDSEval，提出了MEKI数据过滤机制，并识别出MDS特有的关键评估维度，为未来MDS评估方法的研究提供了重要基础和方向。 Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

He Zhang,Anzhou Zhang,Jian Dai

Main category: cs.CL

TL;DR: 提出FOR-Prompting（From Objection to Revision Prompting）协议，通过Defender、Objectioner和Host的角色分工，在无需工具或人工监督的情况下实现自我修正，显著提升大小模型在推理任务上的表现。

Details

Motivation: 现有推理协议如Chain of Thought和Tree of Thought缺乏外部质疑机制来触发自我修订，限制了模型的推理质量和可解释性。 Method: 设计一种非对称的三角色提示协议：Defender提出答案，Objectioner以提问形式提出反对意见但不提供修正方案，Host确保逻辑一致性和结论闭合；整个过程完全在提示层面实现，无需模型微调。 Result: 在GSM8K上比单次提示准确率提升约22个百分点，与CoT相当，并获得GPT-4.1评测中超过10%更高的推理和连贯性评分；Llama3.2:1b小模型准确率提升约19%；能自主纠正复杂问题中的错误，并促进开放性任务中的探索与优化。 Conclusion: FOR-Prompting是一种模型无关、纯提示级别的推理框架，支持本地部署和不同规模模型，具有提升小模型推理能力的巨大潜力，适用于设备端应用和大规模反对引导推理研究。 Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.

[57] How Do Language Models Compose Functions?

Apoorv Khandelwal,Ellie Pavlick

Main category: cs.CL

TL;DR: 研究探讨了大语言模型在解决两跳事实回忆任务时是否使用组合机制，发现存在“组合性差距”，并识别出两种处理机制：组合式和直接式，其选择与嵌入空间几何相关。

Details

Motivation: 探究大语言模型在执行复合任务时是否真正采用组合性机制，而非仅依赖表面关联。 Method: 通过logit lens分析残差流激活，研究模型在两跳事实回忆任务中的内部机制，并分析嵌入空间的几何特性。 Result: 确认了组合性差距的存在；发现了组合式和直接式两种机制；发现机制的选择与输入到输出间是否存在线性映射有关。 Conclusion: 大语言模型不一定以组合方式解决问题，其机制选择受嵌入空间结构影响，部分情况下采用非组合的“惯用”路径。 Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Seungseop Lim,Gibaeg Kim,Wooseok Han,Jean Seo,Hyunkyung Lee,Jaehyo Yoo,Eunho Yang

Main category: cs.CL

TL;DR: 本文提出了一种数据驱动的方法来缓解医疗预咨询对话中大语言模型的“格式惯性”问题，通过重新平衡训练数据中的对话轮次分布，有效提升了长对话中的生成质量。

Details

Motivation: 在医疗领域，监督微调（SFT）常用于适配大语言模型进行多轮对话，但训练数据通常存在轮次分布不均衡问题，导致模型出现“格式惯性”——即生成语法正确但诊断信息不足的重复性问题。 Method: 采用一种简单的数据中心化方法，对训练数据中的对话轮次分布进行重平衡，以减轻格式惯性现象。 Result: 实验结果表明，该方法显著缓解了医疗预咨询任务中大语言模型的格式惯性问题，提升了长对话场景下的生成相关性和信息量。 Conclusion: 通过调整训练数据的轮次分布，可以有效改善大语言模型在长医疗对话中的表现，说明数据分布设计在特定领域对话系统优化中的重要性。 Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung,Neel Joshi,Pratyusha Sharma,Youngjae Yu,Vibhav Vineet

Main category: cs.CL

TL;DR: 本文提出了MathLens基准，用于分解多模态推理中的子技能（感知、推理、整合），并评估不同训练方法对这些技能的影响，发现强化学习主要提升感知能力，而整合能力仍是短板。

Details

Motivation: 现有评估方法仅依赖总体准确率，难以揭示多模态推理模型在哪些具体子技能上取得进展或存在不足，因此需要更细粒度的评估基准。 Method: 构建MathLens基准，将多模态几何问题的性能分解为感知、推理和整合三个部分，并通过视觉图示、文本描述、控制性问题和精细感知探针进行评估，所有数据源自符号化问题规范以确保一致性。 Result: 发现强化学习显著提升感知能力（尤其配合文本监督），文本SFT通过反思性推理间接改善感知；推理能力仅在感知提升时同步增强；整合能力仍最弱，多数剩余错误集中于此；强化学习提高对图表变化的一致性，而多模态SFT因过拟合降低鲁棒性。 Conclusion: 多模态推理模型的子技能发展不均衡，未来应重点关注整合能力的提升，并注意训练方法对模型鲁棒性的影响。 Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

[60] Machine-interpretable Engineering Design Standards for Valve Specification

Anders Gjerver,Rune Frostad,Vedrana Barisic,Melinda Hodkiewicz,Caitlin Woods,Mihaly Fekete,Arild Braathen Torjusen,Johan Wilhelm Kluwer

Main category: cs.CL

TL;DR: 本文提出了一种将工程设计标准中的信息转化为模块化、可重用、机器可解释的本体的方法，并应用于工厂设计和设备选型的质量保证中。

Details

Motivation: 尽管工业界致力于数字化，但产品规范和设计标准仍以文档为中心，缺乏机器可读性和互操作性，限制了自动化和智能应用的发展。 Method: 采用建模模式，从国际管道、材料和阀门设计标准的文本和表格中提取知识，构建符合W3C标准且与顶层本体ISO DIS 23726-3（IDO）对齐的模块化本体，并在阀门选型过程中进行测试。 Result: 实现了基于语义资产模型的阀门实例化及其环境条件的语义表示，通过OWL个体和类实现阀门数据表和制造商产品类型的建模，支持自动化验证特定VDS是否符合行业标准，并利用语义推理判断产品类型是否满足阀门规格。 Conclusion: 基于IDO的共享、可重用模块化本体为设计标准提供了语义推理能力，推动设备选型过程的智能化，展示了标准机构向数字化‘智能标准’转型的潜力。 Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create "functional location tags" as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan,Jie Xu,Qiguang Chen,Junhao Dong,Libo Qin,Xinfeng Li,Haining Yu,Xiaohua Jia

Main category: cs.CL

TL;DR: 本文提出了一个名为“拒绝指数（Refusal Index, RI）”的新指标，用于更准确地衡量大语言模型在未知问题上的知识感知拒绝能力。RI定义为拒绝概率与错误概率之间的斯皮尔曼秩相关系数，并通过轻量级双轮评估方法进行实际测量。实验表明，RI能稳定、一致地评估模型的拒绝行为，不受准确率和拒绝率变化的影响，揭示了当前模型在事实性任务中拒绝行为可能不可靠的问题。

Details

Motivation: 现有衡量大语言模型对未知问题拒绝回答能力的指标存在偏差或间接性，无法真实反映模型的知识感知拒绝能力。需要一种更可靠、无偏且直接衡量该能力的指标。 Method: 提出拒绝指数（RI），即拒绝概率与错误概率之间的Spearman秩相关系数；设计一种轻量化的双轮评估方法，通过两次标准评测运行中的观察拒绝率来高效估计RI。 Result: 在16个模型和5个数据集上的实验表明，RI能准确量化模型在事实性任务中的知识感知拒绝能力；RI在不同拒绝率下保持稳定，提供与模型总体准确率和拒绝率无关的一致排名；发现当前LLM虽然准确率高，但其拒绝行为可能不可靠且脆弱。 Conclusion: 拒绝指数（RI）是一种有效、稳定且具洞察力的指标，能够补充传统准确率指标，推动对大语言模型事实性的更全面评估，强调应重视模型在不确定时的可靠拒绝能力。 Abstract: Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model's actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Ivan Leonidovich Litvak,Anton Kostin,Fedor Lashkin,Tatiana Maksiyan,Sergey Lagutin

Main category: cs.CL

TL;DR: 该研究评估了16种无监督指标在从1000份俄罗斯司法判决中提取七个语义块的质量，基于7168条专家评分为基准，发现词频连贯性和覆盖率/完整性指标与专家评分最一致，而法律术语密度呈负相关；LLM评估分数表现中等，表明无监督方法虽可扩展但无法完全替代高风险法律场景中的人工判断。

Details

Motivation: 随着人工智能在法律自然语言处理中的快速发展，亟需可扩展的文本抽取评估方法，尤其在缺乏标注数据的情况下实现对司法判决中语义信息提取质量的有效衡量。 Method: 研究采用了16种无监督指标，涵盖文档级、语义、结构、伪真实标签和法律特定类别，并在1000份匿名俄罗斯司法判决上进行测试，使用自助法相关性分析、Lin一致性相关系数（CCC）和平均绝对误差（MAE）来对比指标与7168条专家1-5分Likert评分的一致性。 Result: 词频连贯性（Pearson r=0.540, CCC=0.512, MAE=0.127）和覆盖率/完整性（r=0.513, CCC=0.443, MAE=0.139）与专家评分最一致；法律术语密度呈强负相关（r=-0.479, CCC=-0.079）；LLM评估分数（r=0.382, CCC=0.325）表现中等，显示其在法律文本上的局限性。 Conclusion: 无监督指标（包括基于LLM的方法）可用于大规模筛选法律文本提取结果，但由于相关性中等且一致性较低，尚不能完全取代高风险法律应用中的人工评判；本研究为司法分析和伦理AI部署提供了无需标注的评估工具。 Abstract: The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

Xin Liu,Rongwu Xu,Xinyi Jia,Jason Liao,Jiao Sun,Ling Huang,Wei Xu

Main category: cs.CL

TL;DR: 本文提出了一种名为FraudSquad的混合检测模型，用于识别由大语言模型生成的高度逼真的垃圾评论，该模型结合了文本嵌入和门控图变换器，在多个数据集上显著优于现有方法。

Details

Motivation: 大语言模型生成的垃圾评论极具说服力且难以检测，威胁在线平台可信度，现有检测系统难以应对。 Method: 构建三个基于不同大语言模型生成的垃圾评论数据集，并提出FraudSquad模型，融合预训练语言模型的文本嵌入与门控图变换器进行垃圾节点分类，无需人工特征工程。 Result: FraudSquad在三个LLM生成的数据集上比现有最优方法在精确率和召回率上分别最高提升44.22%和43.01%，并在人工撰写垃圾评论数据集上表现良好，模型轻量且所需标注数据少。 Conclusion: FraudSquad是一种高效、实用的LLM时代垃圾评论检测方案，研究强调了适应新型生成模型威胁的紧迫性，并提供了新的合成数据集和开源框架。 Abstract: The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: https://anonymous.4open.science/r/FraudSquad-5389/.

Dane Williamson,Yangfeng Ji,Matthew Dwyer

Main category: cs.CL

TL;DR: 大型语言模型在数学问题求解中表现出色，但在句法偏离训练分布时容易出错。本文提出“句法盲区”概念，指出模型因表面形式与内部表征的脆弱耦合而误用推理策略。通过基于正确示例的句法重写可显著提升准确率，并利用依存局部性理论（DLT）量化句法复杂度，发现其与错误率正相关。结果表明，许多推理错误源于结构错位而非概念困难。

Details

Motivation: 尽管大型语言模型在数学推理任务中表现良好，但其对句法变化敏感，常因表述方式陌生而出错。作者旨在探究这些错误是否源于模型对表面句法形式的过度依赖，而非真正的数学能力不足。 Method: 识别模型在语义简单但句法陌生问题上的失败模式，提出‘句法盲区’概念；使用来自正确回答问题的句法模板对错误问题进行重写；引入基于依存局部性理论（DLT）的句法复杂度度量，分析其与模型性能的关系。 Result: 句法重写在保持语义不变的前提下显著提升了模型准确率；实验显示更高的DLT句法复杂度得分与更高的错误率相关；证明许多推理错误由结构错位引起，而非概念难度。 Conclusion: 大型语言模型的推理错误往往源于句法结构与内部表征之间的脆弱匹配，而非数学能力缺陷；通过语法感知的干预（如句法简化）可以揭示并缓解这类归纳偏差。 Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Shicheng Liu,Kai Sun,Lisheng Fu,Xilun Chen,Xinyuan Zhang,Zhaojiang Lin,Rulin Shao,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: 本文提出了SCRIBES，一种基于强化学习的大规模半结构化网页内容提取框架，利用同一网站内页面布局的相似性生成可重用的提取脚本，并通过在CommonCrawl数据上迭代训练提升性能，在脚本质量和下游任务准确率上均显著优于现有方法。

Details

Motivation: 网页中的半结构化内容（如表格、列表和信息框）包含大量事实数据，但其格式复杂，难以有效提取；现有方法要么泛化能力差，要么因逐页使用大模型推理而资源消耗高。 Method: 提出SCRIBES框架，采用强化学习，以同一站点内页面布局的相似性作为奖励信号，生成可跨多个结构相似页面复用的提取脚本，并利用野外采集的CommonCrawl数据生成合成标注进行迭代训练。 Result: 实验表明，该方法在脚本质量上比强基线高出13%以上，并使GPT-4o在下游问答任务中的准确率提升超4%。 Conclusion: SCRIBES实现了高效、可扩展且资源友好的大规模网页信息提取，通过可重用脚本和迭代式合成训练，为半结构化数据提取提供了新思路。 Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Ece Takmaz,Lisa Bylinina,Jakub Dotlacil

Main category: cs.CL

TL;DR: 本文探讨了在低资源环境下开发语言和多模态模型的方法，旨在解决当前视觉-语言模型与儿童语言学习之间的数据量差距。研究发现多模态模型在纯语言任务上表现较差，因此采用模型融合技术（加权线性插值）将多模态模型与纯语言模型结合，以保持其语言能力，同时维持多模态性能。

Details

Motivation: 当前的视觉-语言模型依赖大量参数和数据，远超儿童语言习得所接触的数据量，因此需要探索更符合儿童学习机制的低资源、发展合理的多模态建模方法。 Method: 构建语言-only和多模态模型，并在低资源、发展合理的数据集上进行训练；使用加权线性插值进行模型融合，将多模态模型与语言-only模型的参数合并。 Result: 多模态模型在语言-only任务（尤其是语法相关基准）上表现不如纯语言模型；通过模型融合可部分缓解该问题，同时保持多模态任务上的性能。 Conclusion: 模型融合是一种有效策略，可在引入多模态信息的同时保留语言模型的语言能力，为构建更贴近人类学习过程的高效AI系统提供了可行路径。 Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Yisu Wang,Ming Wang,Haoyuan Song,Wenjie Huang,Chaozheng Wang,Yi Xie,Xuming Ran

Main category: cs.CL

TL;DR: 本文提出了REPAIR，一种用于大语言模型的持续编辑框架，通过渐进式自适应干预和重整合，实现精确且低成本的模型更新，同时减少知识遗忘和副作用。

Details

Motivation: 现有的大语言模型后训练方法受限于高昂的知识更新成本和重训练带来的意外副作用，缺乏对长期、连续编辑的有效支持。 Method: 提出REPAIR框架，采用闭环反馈机制和动态内存管理来缓解大规模顺序编辑的不稳定性；通过频繁的知识融合和强局部性保护，减少传统方法中的涟漪效应。 Result: 实验表明，REPAIR在多个模型家族中将编辑准确率提高了10%-30%，并显著减少了知识遗忘。 Conclusion: REPAIR为构建可靠、可扩展且持续演化的大型语言模型提供了一个鲁棒的框架。 Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu,Hao Xu,Xuhong Chen,Wei Chen,Yee Whye Teh,Ning Miao

Main category: cs.CL

TL;DR: 本文系统介绍了奖励模型（RMs）及其在大语言模型（LLM）推理中的应用，涵盖了架构、训练方法、评估技术，并探讨了其在生成引导、数据合成和强化学习微调中的关键作用，同时提出了当前面临的开放性问题。

Details

Motivation: 奖励模型在提升大语言模型推理能力方面具有重要作用，但缺乏系统性介绍和全面的应用综述，因此需要对RMs进行全面梳理以推动其有效部署与发展。 Method: 本文通过综述RMs的基本概念（包括架构、训练方法和评估技术），并分类探讨其在LLM推理中的三大应用：推理时的生成引导与输出选择、数据合成与自我迭代优化、以及强化学习中的微调信号提供。 Result: 总结了RMs在LLM推理中的核心应用场景，识别出在选择、泛化、评估和增强方面的关键挑战，并结合现有研究与实证发现提出开放性问题。 Conclusion: RMs对提升LLM推理至关重要，未来需进一步解决其可靠性、泛化能力和评估标准等问题，以实现更有效的部署与持续发展。 Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

[69] Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli,Simone Sestito,Iacopo Masi

Main category: cs.CL

TL;DR: 提出逆向语言建模（ILM）框架，统一提升大语言模型对输入扰动的鲁棒性，并通过反转输出实现原生接地，识别潜在有害输入，增强可控性和可信度。

Details

Motivation: 当前大语言模型（LLM）的防御机制零散且不成熟，缺乏类似分类器领域的系统性防御方法，亟需提升LLM在对抗环境下的鲁棒性和安全性。 Method: 提出逆向语言建模（ILM）框架，通过同时优化模型对输入扰动的鲁棒性，并反转模型输出以识别可能导致不良行为的原始输入触发词，从而实现模型的可分析性和鲁棒性。 Result: ILM使LLM从静态生成器转变为可分析、鲁棒的系统，能够有效识别潜在有毒或不安全的输入，支持红队测试，为下一代更可控、更可信的LLM奠定基础。 Conclusion: ILM提供了一种统一的框架，不仅提升了LLM的对抗鲁棒性，还实现了原生接地能力，推动了更安全、可解释和可控的LLM发展。 Abstract: The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Qi He,Cheng Qian,Xiusi Chen,Bingxiang He,Yi R.,Fung,Heng Ji

Main category: cs.CL

TL;DR: 提出Veri-R1，一种基于在线强化学习的框架，使大语言模型能与搜索引擎交互，通过奖励信号优化其规划、检索和推理能力，显著提升声明验证的准确性和证据得分。

Details

Motivation: 现有声明验证方法多依赖提示工程或预设推理流程，缺乏统一训练范式来提升模型在迭代检索与推理中的综合能力。 Method: 引入Veri-R1，采用在线强化学习框架，让大语言模型与搜索引擎动态交互，并通过显式奖励信号塑造其规划、检索和推理行为。 Result: 实验显示，Veri-R1将联合准确率提升高达30%，证据得分翻倍，且常优于更大规模的模型；消融研究揭示了奖励组件的影响及输出logits与标签准确性的关系。 Conclusion: 在线强化学习能有效提升大语言模型在声明验证中的精确性与忠实性，为未来研究提供了可行路径和基础。 Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

Adina Nicola Dobrinoiu,Ana Cristiana Marcu,Amir Homayounirad,Luciano Cavalcante Siebert,Enrico Liscio

Main category: cs.CL

TL;DR: 本研究探讨了语言模型是否可以通过多维主观标注（情感、情绪、论点和主题）来预测个体的价值观解释，结果表明同时提供这些维度能显著提升预测性能，并强调了考虑个体差异的重要性。

Details

Motivation: 由于价值观的解释受社会文化和个人经历影响而具有主观性，因此开发能够适应多样化人类视角并避免偏向主流观点的AI系统至关重要。 Method: 通过在零样本和少样本设置下实验，评估语言模型利用SEAT四个维度（情感、情绪、论点、主题）的多维主观注释来预测个体价值观解释的能力。 Result: 同时提供所有SEAT维度的模型表现优于单一维度或无个体信息的基线；不同标注者之间的个体差异凸显了纳入主观标注行为的重要性。 Conclusion: 该研究首次在控制环境中探索标注行为对价值观预测的影响，为未来大规模验证奠定了基础。 Abstract: Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.

[72] Exploring Database Normalization Effects on SQL Generation

Ryosuke Kohita

Main category: cs.CL

TL;DR: 本研究首次系统地探讨了数据库模式规范化对自然语言转SQL（NL2SQL）系统性能的影响，发现去规范化模式在简单检索查询中表现更优，而规范化模式在聚合查询中更具优势，建议根据查询类型选择合适的模式设计。

Details

Motivation: 现有NL2SQL研究多忽略模式设计（尤其是规范化）的影响，通常在固定模式上评估模型，缺乏对不同规范化级别影响的系统分析。 Method: 构建具有形式化规范化级别（1NF-3NF）的合成数据集和具有实际设计方案的真实学术论文数据集，评估八种主流大语言模型在不同模式下的表现。 Result: 去规范化模式在简单检索查询中准确率高，适用于低成本零样本设置；规范化模式在聚合查询中表现更好，因其能有效避免数据重复和NULL值问题，但需少量示例即可缓解其带来的连接错误等问题。 Conclusion: NL2SQL系统的最佳模式设计取决于目标查询类型，应根据应用场景自适应选择模式，并在系统开发中重视模式设计的影响。 Abstract: Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization's impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Md Arid Hasan,Firoj Alam,Md Fahad Hossain,Usman Naseem,Syed Ishtiaque Ahmed

Main category: cs.CL

TL;DR: 本文提出了首个用于孟加拉语的多任务仇恨言论检测数据集BanglaMultiHate，并通过多种模型比较，强调了在低资源环境下文化与语言背景预训练的重要性。

Details

Motivation: 现有的孟加拉语仇恨言论检测研究多为单任务且覆盖范围有限，缺乏对多维度信号（类型、严重性、目标）的综合分析，同时低资源语言的内容审核工具较为匮乏。 Method: 构建了一个大规模手动标注的多任务数据集BanglaMultiHate，并系统评估了经典基线模型、单语预训练模型以及大语言模型在零样本提示和LoRA微调下的表现。 Result: 实验表明，尽管经过LoRA微调的大模型表现接近BanglaBERT，但具有文化与语言基础的预训练对提升性能至关重要。 Conclusion: 该研究为低资源语言环境下的多任务仇恨言论检测提供了更强大的基准，强调了文化适配在内容审核工具开发中的关键作用。 Abstract: Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung

Main category: cs.CL

TL;DR: 本研究采用叙事学视角，通过约束性决策框架分析大语言模型（LLM）作为计算作者的创作过程，发现模型在创作中普遍优先考虑风格而非人物、事件或场景，并揭示了不同模型在创造性偏好上的独特特征。

Details

Motivation: 现有对大语言模型创造力的评估多关注输出质量，而忽视其生成过程。本文旨在从创作过程角度出发，借鉴叙事学理论，探索一种新的系统性方法来理解LLM的作者性创造力。 Method: 引入基于约束的决策机制作为分析框架，通过控制提示词为模型赋予不同的作者角色，并分析其在叙事元素选择中的偏好及其解释逻辑，从而探究模型的创造性决策过程。 Result: 实验结果显示，大语言模型在创作决策中始终优先强调‘风格’，其次才是人物、事件和背景；同时，不同模型展现出可识别的创造性偏好模式，且其自我解释揭示出一定的推理一致性。 Conclusion: 该方法为评估AI的作者性创造力提供了一种新颖且系统的工具，表明关注生成过程而非仅输出结果，有助于更深入理解LLM的创造性行为。 Abstract: Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Siddhant Arora,Haidar Khan,Kai Sun,Xin Luna Dong,Sajal Choudhary,Seungwhan Moon,Xinyuan Zhang,Adithya Sagar,Surya Teja Appini,Kaushik Patnaik,Sanat Sharma,Shinji Watanabe,Anuj Kumar,Ahmed Aly,Yue Liu,Florian Metze,Zhaojiang Lin

Main category: cs.CL

TL;DR: 本文提出了一种流式检索增强生成（Streaming RAG）框架，用于在端到端语音对话系统中实现低延迟的工具调用，通过在用户说话的同时预测并执行工具查询，显著提升了问答准确性和响应速度。

Details

Motivation: 现有的端到端语音对话系统虽然具有低延迟和自然表达的优势，但容易因缺乏事实支撑而产生幻觉；而传统文本系统通过引入外部工具缓解该问题，因此需要将工具使用扩展到语音系统中。 Method: 提出Streaming RAG框架，通过后训练方法使模型能在用户语音输入过程中预测何时发起工具调用，并将检索结果与音频内容融合生成口语化回复；同时构建AudioCRAG语音评测基准用于评估。 Result: 实验表明，该方法使问答准确率相对提升200%（绝对值从11.1%提升至34.2%），工具使用延迟减少20%，且支持多种输入模态。 Conclusion: Streaming RAG实现了语音对话系统中高效、低感知延迟的工具集成，为实现实时、具身化的AI助手提供了可行路径。 Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Hala Sheta,Eric Huang,Shuyu Wu,Ilia Alenabi,Jiajun Hong,Ryker Lin,Ruoxi Ning,Daniel Wei,Jialin Yang,Jiawei Zhou,Ziqiao Ma,Freda Shi

Main category: cs.CL

TL;DR: VLM-Lens 是一个开源工具包，用于系统化评估、分析和解释视觉语言模型（VLMs），支持从任意层提取中间输出，提供统一的配置接口，并兼容多种主流 VLM。

Details

Motivation: 为了促进对视觉语言模型内部机制的理解，需要一个能够统一、灵活地提取和分析 VLM 中间表示的工具，以支持跨模型的系统性研究。 Method: 设计并实现了一个名为 VLM-Lens 的工具包，通过 YAML 配置抽象化模型复杂性，支持在前向传播过程中从任意层提取中间输出，并集成多种可解释性分析方法。 Result: VLM-Lens 支持 16 个最先进的基础 VLM 及其 30 多个变体，展示了跨层和跨概念的隐藏表征差异，并可通过简单扩展支持新模型。 Conclusion: VLM-Lens 提供了一个灵活、易用且可扩展的平台，有助于推动社区对视觉语言模型的理解与改进。 Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Siddhant Arora,Jinchuan Tian,Hayato Futami,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe

Main category: cs.CL

TL;DR: 提出了一种名为SCoT的流式思维链（CoT）框架，用于双工语音对话系统，通过分块处理用户输入并生成响应，提升连贯性、可解释性，并支持低延迟和重叠交互。

Details

Motivation: 现有端到端对话系统依赖语音活动检测（VAD）进行轮次切换，但难以区分停顿与话语结束；双工模型虽可连续预测输出，但结构复杂且语义推理能力较弱。 Method: 提出SCoT框架，采用流式思维链机制，交替处理固定时长的用户输入并以分块方式生成响应，利用帧级对齐为每一块生成中间目标对齐的转录和响应。 Result: 实验表明，该方法相比现有双工方法生成更连贯、可解释的响应，在低延迟和允许用户与系统重叠交互方面优于逐轮系统。 Conclusion: SCoT框架有效解决了传统VAD和双工模型的局限，在保持低延迟的同时提升了语义推理能力和响应质量。 Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

[78] The Disparate Impacts of Speculative Decoding

Jameson Sandler,Ahmet Üstün,Marco Romanelli,Sara Hooker,Ferdinando Fioretto

Main category: cs.CL

TL;DR: 本文分析了推测解码在不同任务中的加速效果，发现其对欠拟合和代表性不足的任务加速效果较差，存在不公平性；通过理论分析提出缓解策略，在多个模型组合上平均提升了12%的公平性指标。

Details

Motivation: 推测解码虽能加速大语言模型推理，但其在不同任务上的加速效果可能存在不均衡，尤其可能加剧对欠拟合或代表性不足任务的性能差距，因此需探究并缓解这种不公平性。 Method: 通过理论分析量化推测解码在不同任务上的加速差异，识别导致不公平加速的关键因素，并基于这些洞察提出一种旨在减少加速差异的缓解策略，最后在多个模型组合上进行实验验证。 Result: 实验证明推测解码的加速效果在不同任务上不均衡，对欠拟合任务加速更少；所提策略有效减少了这种差异，在多个模型组合上平均提升了12%的公平性指标。 Conclusion: 推测解码带来的加速效果存在任务间的不公平性，本文提出的缓解策略能够显著改善这一问题，为更公平高效的推理方法提供了方向。 Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

[79] RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu,Will Su,Leitian Tao,Haozhu Wang,Aashu Singh,Hanchao Yu,Jianyu Wang,Hongyang Gao,Weizhe Yuan,Jason Weston,Ping Yu,Jing Xu

Main category: cs.CL

TL;DR: RESTRAIN是一种无需黄金标签的强化学习框架，通过自我惩罚机制利用未标注数据提升大模型的推理能力，在多个复杂推理基准上显著提高性能，接近有监督训练的效果。

Details

Motivation: 现有的基于人类标注数据的强化学习虽然提升了大模型的推理能力，但依赖大量标注数据且在困难任务上表现不佳，需要一种无需人工标注、能从无标签数据中持续学习的方法。 Method: 提出RESTRAIN框架，利用模型自身生成的答案分布作为学习信号，对过度自信的输出和低一致性的样本进行自我惩罚，同时保留有潜力的推理路径，并将其集成到GRPO等策略优化方法中，实现无监督下的持续自我改进。 Result: 在AIME25、MMLU_STEM和GPQA-Diamond等多个挑战性推理基准上，使用Qwen3-4B-Base和OctoThinker Hybrid-8B-Base模型时，Pass@1分别提升了+140.7%、+36.2%和+19.6%，性能接近使用黄金标签训练的结果。 Conclusion: RESTRAIN为无需黄金标签的强推理能力提供了可扩展的路径，展示了无监督自我改进在复杂推理任务中的巨大潜力。 Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

[80] Learning to Reason for Hallucination Span Detection

Hsuan Su,Ting-Yao Hu,Hema Swetha Koppula,Kundan Krishna,Hadi Pouransari,Cheng-Yu Hsieh,Cem Koc,Joseph Yitan Cheng,Oncel Tuzel,Raviteja Vemulapalli

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的框架RL4HS，用于检测大语言模型生成内容中的幻觉片段，通过引入细粒度奖励机制，在RAGTruth基准上优于现有方法。

Details

Motivation: 现有方法多将幻觉检测视为二分类任务，难以满足实际应用中对定位幻觉片段的需求，因此需要更精细的检测方法。 Method: 提出RL4HS框架，结合Chain-of-Thought推理与强化学习，采用基于组相对策略优化和类别感知策略优化的span级奖励函数来训练模型。 Result: 在RAGTruth基准的多个任务上实验表明，RL4HS优于预训练推理模型和监督微调方法，验证了span级奖励对幻觉片段检测的有效性。 Conclusion: 显式推理结合span级强化学习能有效提升幻觉片段检测性能，RL4HS为复杂幻觉检测任务提供了新思路。 Abstract: Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

[81] ARUQULA -- An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities

Felix Brei,Lorenz Bühmann,Johannes Frey,Daniel Gerber,Lars-Peter Meyer,Claus Stadler,Kirill Bulert

Main category: cs.CL

TL;DR: 本文提出了一种基于SPINACH的通用方法，通过大语言模型将自然语言问题迭代地转换为SPARQL查询，以降低知识图谱查询的门槛。

Details

Motivation: 由于SPARQL查询语言对非计算机专业背景的人难度较高，因此需要一种能降低使用门槛的方法。大语言模型可通过Text2SPARQL转换提供支持，本文旨在应对Text2SPARQL挑战并推动该领域的发展。 Method: 采用基于LLM的代理SPINACH，将自然语言问题转化为SPARQL查询，过程为多轮迭代的探索与执行，而非单次转换。文中描述了系统架构及设计决策背后的推理。 Result: 实现了对代理行为的深入分析，揭示了其在Text2SPARQL任务中的表现，并识别出未来可改进的方向。 Conclusion: 该迭代式方法有效降低了用户与知识图谱交互的难度，展示了LLM在语义解析任务中的潜力，为后续优化提供了依据。 Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.

[82] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Lingzhong Dong,Ziqi Zhou,Shuaibo Yang,Haiyue Sheng,Pengzhou Cheng,Zongru Wu,Zheng Wu,Gongshen Liu,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的评估框架，用于诊断基于视觉语言模型的移动使用代理中的推理-执行差距，核心是通过“真实对齐”（GTA）指标衡量链式思维推理与真实动作的一致性，结合精确匹配（EM）指标，揭示了普遍存在且模型规模增大仍难以消除的执行差距和推理差距。

Details

Motivation: 现有研究在评估移动代理时只关注执行准确性，忽视了链式思维（CoT）推理是否与真实动作对齐，可能导致用户过度信任看似合理但实际错误的推理，从而引发安全风险。因此需要一种能同时评估推理与执行一致性的新框架。 Method: 提出Ground-Truth Alignment（GTA）指标，判断CoT推理所暗示的动作是否与真实动作一致，并结合标准的Exact Match（EM）指标，共同识别两种推理-执行差距：执行差距（EG）和推理差距（RG）。 Result: 实验表明推理-执行差距普遍存在，其中执行差距比推理差距更频繁；尽管增大模型规模可减少总体差距，但大型模型中仍存在显著的执行差距；该框架能可靠反映当前最先进模型中的系统性EG/RG模式。 Conclusion: 所提出的评估框架有助于诊断移动代理中的推理与执行不一致问题，为构建更可信的智能代理提供了重要基础。 Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

[83] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Xiaoyang Yuan,Yujuan Ding,Yi Bin,Wenqi Shao,Jinyu Cai,Jingkuan Song,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: 提出了一种名为AMPO的自适应多引导策略优化框架，通过在需要时从多个教师模型获取指导，提升大语言模型的推理能力和泛化性能。

Details

Motivation: 现有强化学习方法依赖单一教师或自我探索生成长思维链，易引入模型偏差并限制探索多样性，从而影响推理效果。 Method: 引入多教师指导机制，采用‘按需指导’策略，在模型失败时自适应选择多个优秀教师的可理解推理路径进行学习，并结合理解感知的选择机制平衡探索与利用。 Result: AMPO在数学推理任务上比基线GRPO提升4.3%，在分布外任务上提升12.2%，显著提高Pass@k表现和推理多样性；使用四个同规模教师即可达到更强教师模型的性能水平。 Conclusion: AMPO提供了一条更高效、可扩展的路径来提升大语言模型的推理能力与泛化性，验证了多教师协同指导在强化学习中的有效性。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

[84] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches

Ebtesam Jaber Aljohani,Wael M. S. Yafoo

Main category: cs.CL

TL;DR: 本文提出并评估了多种深度学习模型用于检测阿拉伯语网络欺凌内容，通过构建包含10,662条X平台帖子的数据集，结合多种词嵌入和BERT预训练模型进行实验，结果显示Bi-LSTM与FastText结合的模型表现最佳，准确率达到98%。

Details

Motivation: 阿拉伯语网络欺凌检测研究稀缺，而社交媒体对青少年情感健康构成威胁，亟需有效的自动化检测方法。 Method: 收集并预处理10,662条阿拉伯语X帖子数据，使用kappa工具提升标注质量；实验比较了LSTM、Bi-LSTM结合不同词嵌入（如FastText）以及与BERT集成模型的效果。 Result: LSTM-BERT和Bi-LSTM-BERT模型准确率达97%，Bi-LSTM结合FastText词嵌入表现最优，准确率达98%。 Conclusion: Bi-LSTM结合FastText是当前检测阿拉伯语网络欺凌最有效的方法，结果具有良好的泛化能力。 Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize

[85] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Linh The Nguyen,Chi Tran,Dung Ngoc Nguyen,Van-Cuong Pham,Hoang Ngo,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: 提出AccurateRAG框架，用于构建高性能的检索增强生成问答系统，在多个基准数据集上达到最先进性能。

Details

Motivation: 提升检索增强生成（RAG）在问答任务中的性能和开发效率。 Method: 设计包含数据处理、微调数据生成、文本嵌入、大模型微调、输出评估等模块的完整本地化RAG开发流程。 Result: 实验结果显示该框架优于先前强基线，在多个基准数据集上实现新的SOTA问答性能。 Conclusion: AccurateRAG为高效构建高性能RAG问答系统提供了有效且完整的解决方案。 Abstract: We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

[86] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Tianyi Jiang,Yi Bin,Yujuan Ding,Kainian Zhu,Fei Ma,Jingkuan Song,Heng Tao Shen

Main category: cs.CL

TL;DR: 提出一种新的推理范式“先探索，后决定”及累积熵调节机制（CER），利用TECA指标动态控制推理深度，有效缓解大模型在简单问题上的过度思考问题，显著缩短响应长度并保持解题能力。

Details

Motivation: 大语言模型在简单问题上常出现过度推理（overthinking），导致效率低下且难以根据问题复杂度自适应调整推理深度。 Method: 引入Token Entropy Cumulative Average（TECA）作为衡量推理过程中探索程度的指标，并结合“探索-决策”推理范式与累积熵调节（CER）机制，动态判断何时终止推理。 Result: 在多个数学基准测试中，该方法显著减少了过度思考现象，在保持解题性能的同时，使简单数据集上的平均响应长度最多减少71%。 Conclusion: 所提出的TECA和CER机制能有效实现推理过程的自适应控制，提升大模型在不同复杂度问题下的推理效率和适应性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

[87] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du,Yuanshuo Zhang,Xiyuan Yang,Yifan Zhou,Cheng Wang,Gongyi Zou,Xianghe Pang,Wenhao Wang,Menglan Chen,Shuo Tang,Zhiyu Li,Siheng Chen

Main category: cs.CL

TL;DR: 本文提出了InfoMosaic-Bench，首个用于评估工具增强型代理在多源信息获取能力的基准，涵盖医学、金融、地图等六个领域，实验表明现有大模型在工具使用和多源信息整合方面仍存在显著不足。

Details

Motivation: 现有LLM代理依赖开放网络搜索，存在信息噪声大、缺乏专业领域知识的问题，且尚不清楚代理能否有效结合通用搜索与专用工具解决复杂任务。 Method: 提出InfoMosaic-Bench基准和InfoMosaic-Flow生成流程，通过验证工具输出、引入跨源依赖和过滤简单案例，构建可靠且非平凡的多源信息任务。 Result: 14种最先进LLM代理的实验显示：仅靠网页信息表现不佳（GPT-5准确率38.2%）；领域工具效果有限且不一致；22.4%的失败源于工具选择或使用错误。 Conclusion: 当前工具增强型LLM代理在多源信息整合和工具调用方面仍面临重大挑战，需进一步研究提升其工具利用能力和决策可靠性。 Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

[88] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

Wen Yang,Junhong Wu,Chong Li,Chengqing Zong,Jiajun Zhang

Main category: cs.CL

TL;DR: 本研究从跨语言视角探讨了基于强化学习的推理泛化能力，发现英语中心的大型推理模型在跨语言迁移中表现差异显著，并提出了衡量跨语言可迁移性的新指标。实验表明，引入单一平行语言即可带来性能跃升（First-Parallel Leap），且跨语言迁移遵循平行扩展定律；同时揭示了单语泛化差距，挑战了当前推理模型模拟人类认知的假设。

Details

Motivation: 现有研究主要关注强化后训练在任务或模态上的泛化，而忽视了跨语言推理泛化的潜力。本文旨在探究英语训练出的推理能力是否能有效迁移到其他语言，从而推动更语言中立的大型推理模型发展。 Method: 通过在多语言推理基准上系统评估以英语为中心的大型推理模型，提出量化跨语言可迁移性的指标，并进行干预性和平行训练研究，分析不同模型、语言和训练范式下的迁移表现。 Result: 发现跨语言迁移能力受初始模型、目标语言和训练方式影响显著；具备较强英文能力的模型更依赖英文特有模式，导致跨语言泛化下降；引入首个平行语言即带来显著性能提升（First-Parallel Leap）；跨语言推理迁移遵循幂律规律（Parallel Scaling Law）；识别出Monolingual Generalization Gap，表明当前模型未能充分实现语言间泛化。 Conclusion: 英语中心的推理训练限制了跨语言泛化能力，单纯依赖英文强化后训练不足以构建真正语言无关的推理系统。引入多语言平行数据可显著提升迁移效果，未来应重视语言均衡训练以实现更接近人类的通用推理能力。 Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.

[89] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM是一系列从基础模型直接微调而来的高效嵌入模型，包含0.6B、1.7B和4B三种规模，在MTEB榜单上表现优异，且训练成本低、可复现性强，模型、数据集和代码均已开源。

Details

Motivation: 现有高性能嵌入模型依赖大规模对比预训练和昂贵的合成数据，训练成本高且难以复现，因此需要一种更经济、高效的替代方案。 Method: F2LLM基于基础大模型，使用从开源非合成数据集中整理的600万查询-文档-负样本三元组进行直接微调，避免复杂的训练流程和合成数据生成。 Result: 在MTEB英文排行榜上，F2LLM-4B在约4B参数模型中排名第二，整体第七；F2LLM-1.7B在其参数规模（1B-2B）范围内排名第一。 Conclusion: F2LLM在性能、模型大小和训练成本之间取得了良好平衡，是一个强健、可复现且预算友好的基线模型，有助于推动嵌入模型的研究与应用。 Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

[90] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Raphael Tang,Crystina Zhang,Wenyan Li,Carmen Lai,Pontus Stenetorp,Yao Lu

Main category: cs.CL

TL;DR: 本文质疑了在大语言模型竞技场评估中将平局视为模型能力相等的传统做法，提出平局更多反映的是问题难度而非模型水平相近。通过在三个真实数据集上的实验表明，忽略平局的评分更新可使四种评分系统的预测准确率相对提升1-3%。研究还发现平局更常出现在非常简单或高度客观的问题上，建议未来的评分系统应重新考虑平局语义并纳入问题属性。

Details

Motivation: 现有的LLM评估体系将平局等同于模型能力相等，采用类似国际象棋的Elo评分系统进行处理，但这种假设可能不合理。本文旨在探讨平局是否真的表示两个模型能力相当，还是更多反映了查询本身的特性（如难度或客观性），从而挑战当前主流的评分范式。 Method: 作者在三个真实世界的大语言模型竞技场数据集上，对比了是否忽略平局对评分更新的影响，评估其对战斗结果预测准确性（包括胜负和平局）的提升效果；同时分析了平局发生与查询难易程度及主观/客观性质之间的关联，使用风险比（risk ratio）量化这些关系。 Result: 实验结果显示，在四种评分系统中，忽略平局带来的评分调整可使预测准确率相对提高1-3%；进一步分析表明，平局更可能出现在被评定为非常容易（风险比1.37）和高度客观（风险比1.35）的查询上。 Conclusion: 平局不应简单解释为模型能力相等，而更可能是由于问题过于简单或具有高客观性所致。因此，现有基于Elo的评分系统应重新审视平局的语义，并在评分更新机制中引入查询属性，以提升评估准确性。 Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

cs.CV [Back]

[91] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

Alessio Spagnoletti,Andrés Almansa,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本文提出了LVTINO，首个基于视频一致性模型（VCM）的零样本高清晰度视频恢复方法，通过利用VCM捕捉时间因果性，在保持测量一致性和平滑时间过渡的同时，实现了最先进的视频重建质量。

Details

Motivation: 现有的基于图像的潜在扩散模型在逐帧处理视频时存在时间不一致的问题，难以满足高清晰度视频恢复中对精细空间细节和微妙时间依赖性的双重需求。 Method: 利用近期发展的视频一致性模型（VCM），将视频潜在扩散模型蒸馏为快速生成器，并提出LVTINO框架，采用无需自动微分的条件机制，在少量神经网络函数评估下实现高效视频恢复。 Result: 在多种视频逆问题上进行的大量实验表明，LVTINO在感知质量、重建保真度和计算效率方面显著优于当前最先进的逐帧图像LDM方法，实现了时间上更连贯的重建结果。 Conclusion: LVTINO成功地将零样本扩散先验扩展到高清晰度视频恢复任务，通过显式建模时间因果关系，为视频逆问题提供了一种高效且高质量的解决方案，建立了新的性能基准。 Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.

[92] Image Generation Based on Image Style Extraction

Shuochen Chang

Main category: cs.CV

TL;DR: 提出一种基于三阶段训练的风格提取图像生成方法，利用单一样式参考图像实现细粒度控制的文本到图像生成。

Details

Motivation: 现有文本到图像生成模型难以精确描述和控制细粒度风格，且参考图像的风格信息难以与文本条件对齐。 Method: 提出三阶段训练方法，使用风格编码器和风格投影层，将从单个参考图像提取的风格表示与文本表示对齐，并注入生成模型中，保持原有结构不变。 Result: 实现了无需修改下游生成模型结构的细粒度风格控制图像生成，并构建了Style30k-captions数据集用于训练和验证。 Conclusion: 该方法有效提升了文本引导图像生成中的风格控制精度，尤其在仅有一个参考图像的情况下表现出良好的生成能力。 Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

[93] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels

Shijia Feng,Michael Wray,Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: 本文提出并发布了一个用于判断技能学习过程中挣扎行为的数据集EvoStruggle，包含61.68小时的视频、2,793个视频片段和5,385个标注的挣扎时间段，涵盖76名参与者在18项任务中的五次重复表现。作者将挣扎判定建模为时序动作定位任务，并验证了现有模型可在未见任务或活动中检测挣扎，跨任务平均mAP为34.56%，跨活动为19.24%，表明挣扎具有可迁移性但仍具挑战性。

Details

Motivation: 准确识别技能习得过程中的挣扎对于优化人类学习和开发有效的辅助系统至关重要。随着技能发展，挣扎的形式和频率发生变化，理解这种演变有助于判断学习阶段。然而现有操作数据集未关注挣扎随时间的演化。 Method: 收集了一个名为EvoStruggle的大规模多任务视频数据集，包含四类活动（打结、折纸、七巧板、洗牌）共18项任务，76名参与者每项任务重复五次。将挣扎判定定义为时序动作定位任务，使用Temporal Action Localization模型进行实验，评估其在跨任务和跨活动场景下的泛化能力。 Result: 实验结果显示，模型在跨任务泛化下取得34.56%的平均mAP，在跨活动泛化下达到19.24%，证明挣扎检测具有一定的可迁移性，但仍有提升空间。模型能够捕捉不同技能任务中的挣扎线索，即使面对未见过的任务或活动类型。 Conclusion: 挣扎是可在不同技能任务间迁移的概念，该研究通过构建大规模纵向数据集推动了自动挣扎识别的发展，为个性化学习辅助系统提供了数据基础和技术验证。 Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.

[94] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs

Abu Bucker Siddik,Diane Oyen,Alexander Most,Michal Kucer,Ayan Biswas

Main category: cs.CV

TL;DR: 提出了一种名为Small PDE U-Net Solver (SPUS)的紧凑高效基础模型，用于统一求解多种偏微分方程（PDEs），采用轻量级残差U-Net架构和自回归预训练策略，在参数更少、微调数据需求低的情况下实现了最先进的泛化性能。

Details

Motivation: 现有基于大型Transformer架构的PDE基础模型计算和参数开销高，缺乏高效且通用的神经算子模型，因此需要一种更轻量、高效的统一框架来求解广泛PDEs。 Method: 设计了一个基于轻量级残差U-Net的神经算子模型SPUS，并采用模仿数值求解器行为的自回归预训练策略，在多样化流体动力学PDE数据集上进行预训练，随后在多种未见PDE任务上进行评估。 Result: SPUS在6个具挑战性的下游PDE任务上表现出优异的泛化能力，达到最先进水平，同时参数数量显著减少，且仅需极少微调数据。 Conclusion: SPUS证明了U-Net类架构作为PDE基础模型的潜力，提供了一种参数高效、易于部署的通用PDE求解方案，为未来轻量级科学计算模型的发展提供了新方向。 Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.

[95] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

Shubhankar Borse,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: DisCo是一种基于强化学习的框架，通过多样性约束优化多人体图像生成中的身份多样性，解决了现有文本到图像模型在生成多人场景时的身份重复、面部相似和人数错误等问题。

Details

Motivation: 现有的文本到图像模型在生成多个人物时存在身份混淆、面部重复和人数不准确的问题，缺乏对身份多样性的有效控制。 Method: 提出DisCo框架，采用Group-Relative Policy Optimization（GRPO）微调流匹配模型，设计包含面部差异惩罚、跨样本身份重复抑制、人数准确性以及视觉保真度的组合奖励函数，并通过单阶段课程学习稳定训练过程。 Result: 在DiverseHumans测试集上，DisCo实现了98.6%的独特面部准确率和接近完美的全局身份分布，在开源和闭源方法中均超越现有技术，同时保持良好的视觉质量。 Conclusion: DisCo是首个直接优化多人体生成中身份多样性的RL框架，无需额外标注，可扩展性强，解决了生成模型中的长期存在的身份危机问题，为多人体组合生成设定了新基准。 Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

[96] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna,Nicholas Meegan,Han-Pang Chiu,Supun Samarasekera,Rakesh Kumar

Main category: cs.CV

TL;DR: 本文提出了一种新的地理表示方法，通过将查询图像的视觉表征与分层地理嵌入对齐，并结合外观特征和语义分割图来提升全球视觉地理定位性能，在五个基准数据集上25项指标中的22项超过了现有最先进方法。

Details

Motivation: 现有的视觉地理定位方法在学习地理表征方面仍有不足，难以在全球范围内准确匹配图像与其地理位置，因此需要更有效的地理和视觉表征对齐机制。 Method: 将地理定位建模为视觉表征与层次化地理嵌入的对齐问题，并引入一种有效融合查询图像外观特征与其语义分割图的方法，以构建鲁棒的视觉表征。 Result: 在五个基准数据集的25项指标中，有22项取得了当前最优性能，优于之前的SOTA方法和最新的大型视觉语言模型（LVLMs）。 Conclusion: 实验表明，所提出的层次化地理表示与增强的视觉表征融合策略显著提升了全球视觉地理定位的准确性，其性能增益主要来自地理与视觉表征的有效结合。 Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

Nilay Naharas,Dang Nguyen,Nesihan Bulut,Mohammadhossein Bateni,Vahab Mirrokni,Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: 本文提出了XMAS，一种基于跨模态注意力矩阵相似性的新型数据选择方法，用于高效微调大型视觉-语言模型（LVLMs）。该方法通过聚类注意力矩阵的奇异值轨迹来识别并去除训练数据中的冗余样本，在保留完整性能的同时显著减少数据量和训练时间。

Details

Motivation: 现有的数据选择方法在LVLM上表现不佳，甚至无法超越随机选择。因此，需要一种针对LVLM特性的原则性方法来实现数据高效训练。 Method: 提出XMAS方法，首先微调一个小型代理LVLM，提取训练过程中跨模态注意力矩阵的顶部奇异值轨迹，并以此对样本进行聚类，最后从各簇中均衡采样以构建训练子集。 Result: XMAS可在LLaVA-665k数据集中减少50%数据、在Vision-Flan中减少85%数据，同时保持LLaVA-1.5-7B在10个下游任务上的性能不变，并使训练速度提升1.2倍，相比最优基线多减少30%数据。 Conclusion: XMAS是首个针对LVLM指令微调的有效数据选择方法，通过利用注意力矩阵的梯度相似性实现了高效去冗余，为大规模多模态模型的高效训练提供了新方向。 Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

[98] Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan,Vincent Tao Hu,Grigory Bartosh,Björn Ommer,Cees G. M. Snoek,Max Welling,Jan-Willem van de Meent,Mohammad Mahdi Derakhshani,Floor Eijkelboom

Main category: cs.CV

TL;DR: Purrception是一种用于向量量化图像生成的变分流匹配方法，通过在连续嵌入空间中计算速度场的同时学习码本索引的分类后验，实现了离散监督与连续传输动力学的结合。

Details

Motivation: 现有的图像生成方法在连续流匹配和离散流匹配之间存在效率和性能的权衡，缺乏有效结合两者优势的方法。 Method: 将变分流匹配应用于向量量化潜在空间，学习码本索引的分类后验分布，同时在连续嵌入空间中计算速度场，实现连续动态与离散监督的融合。 Result: 在ImageNet-1k 256x256图像生成任务上，Purrception比连续和离散流匹配基线收敛更快，并取得与最先进模型相当的FID分数。 Conclusion: 变分流匹配能够有效结合连续传输动力学与离散监督，在图像生成中提升训练效率并保持竞争力的生成质量。 Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

[99] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging

Yuxuan Ou,Ning Bi,Jiazhen Pan,Jiancheng Yang,Boliang Yu,Usama Zidan,Regent Lee,Vicente Grau

Main category: cs.CV

TL;DR: 提出一种基于条件扩散模型的多任务学习框架，用于从非增强CT生成合成增强CT并同时分割主动脉腔和血栓，实现端到端联合优化，在图像质量和分割精度上均优于现有方法。

Details

Motivation: 减少对比剂使用带来的肾毒性、过敏反应和环境危害，克服传统多阶段方法中误差累积和语义结构未共享的问题。 Method: 结合条件扩散模型与多任务学习，共享编码器-解码器参数，采用半监督策略训练，无需初始分割预测，实现合成增强CT与主动脉腔/血栓分割的联合优化。 Result: 在264名患者数据上验证，PSNR达25.61 dB（优于单任务CDM的23.80 dB），腔分割Dice达0.89（提升自0.87），血栓Dice达0.53（提升自0.48），临床测量误差显著降低。 Conclusion: 所提统一框架能有效生成高质量合成增强CT并精确分割关键结构，减少对真实对比剂扫描的依赖，具有临床应用潜力。 Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.

[100] From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Basem Rizk,Joel Walsh,Mark Core,Benjamin Nye

Main category: cs.CV

TL;DR: 提出一个框架，用于高效构建多模态内容分析管道，将视频转换为时序半结构化数据，并进一步转化为可查询的帧级知识图谱，支持持续学习。

Details

Motivation: 多模态内容分析复杂、计算成本高，且将预训练模型应用于视频等复杂数据存在融合难题。 Method: 结合多个预训练模型构建管道，将视频转为时序半结构化数据，并进一步转化为帧级索引的知识图谱。 Result: 实现了可查询、支持持续学习的视频知识表示，便于动态融入领域知识。 Conclusion: 该框架有效降低了多模态视频分析的工程难度，提升了灵活性和可扩展性。 Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

[101] WALT: Web Agents that Learn Tools

Viraj Prabhu,Yutong Dai,Matthew Fernandez,Jing Gu,Krithika Ramakrishnan,Yanqi Luo,Silvio Savarese,Caiming Xiong,Junnan Li,Zeyuan Chen,Ran Xu

Main category: cs.CV

TL;DR: WALT是一种新型网页代理框架，通过逆向工程提取网站内置功能并封装为可调用工具，实现更鲁棒、高效的浏览器自动化。

Details

Motivation: 现有网页代理方法依赖细粒度UI操作和大量LLM推理，在动态布局和长程任务中表现脆弱；而人类则利用网站提供的搜索、筛选等高级功能高效操作，因此需要一种更稳定、抽象的自动化范式。 Method: 提出WALT框架，通过分析网站行为和结构，自动发现并封装可重用的高层工具（如search、filter、create等），代理通过调用这些工具而非低级点击来完成任务。 Result: 在VisualWebArena和WebArena基准上，WALT相比现有方法以更少步骤、更低LLM依赖实现了更高的任务成功率。 Conclusion: WALT通过抽象出网站内置功能作为可靠工具，将自动化重心从脆弱的逐步推理转移到稳健的工具调用，为浏览器自动化提供了更具泛化性和鲁棒性的新范式。 Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

[102] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

Meilong Xu,Xiaoling Hu,Shahira Abousamra,Chen Li,Chao Chen

Main category: cs.CV

TL;DR: 提出了一种半监督分割框架，通过多扰动预测和拓扑一致性约束，有效减少组织病理学图像分割中的拓扑错误。

Details

Motivation: 在组织病理学图像分析中，无标签数据的语义结构提取具有挑战性，尤其是对象密集分布时，需有效捕捉有意义的拓扑特征。 Method: 利用随机dropout和时间训练快照生成多个扰动预测，通过结合空间重叠与全局结构对齐的新匹配策略，强制跨预测的拓扑一致性。 Result: 实验表明该方法显著减少了拓扑错误，提升了分割的鲁棒性和准确性。 Conclusion: 所提方法能有效保留关键拓扑特征，提高半监督分割性能，适用于下游生物医学分析。 Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.

[103] Towards Better Optimization For Listwise Preference in Diffusion Models

Jiamu Bai,Xin Yu,Meilong Xu,Weitao Lu,Xin Pan,Kiwan Maeng,Daniel Kifer,Jian Wang,Yu Wang

Main category: cs.CV

TL;DR: 本文提出了Diffusion-LPO，一种用于扩散模型中列表偏好优化的简单有效框架，基于Plackett-Luce模型扩展DPO目标，利用排序图像列表提升文本到图像生成、图像编辑和个性化对齐任务中的视觉质量和偏好对齐性能。

Details

Motivation: 现有的基于人类反馈的强化学习方法在扩散模型中主要依赖成对偏好，难以充分利用隐含的排序信息；而实际的人类反馈通常包含更精确的排名信息，因此需要一种能够精确优化列表级偏好的方法。 Method: 提出Diffusion-LPO框架，将用户反馈聚合为按排名排列的图像列表，并在Plackett-Luce模型下推导出DPO目标的列表级扩展，通过鼓励每个样本优于所有排名较低的样本来实现整体排序一致性。 Result: 在文本到图像生成、图像编辑和个性化偏好对齐等多个任务上验证了Diffusion-LPO的有效性，结果表明其在视觉质量和偏好对齐方面均优于成对DPO基线方法。 Conclusion: Diffusion-LPO提供了一种简单且有效的途径来利用列表级人类偏好数据，在扩散模型中实现了比传统成对方法更优的性能，推动了基于人类反馈的模型对齐技术的发展。 Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

[104] Growing Visual Generative Capacity for Pre-Trained MLLMs

Hanyu Wang,Jiaming Han,Ziyan Yang,Qi Zhao,Shanchuan Lin,Xiangyu Yue,Abhinav Shrivastava,Zhenheng Yang,Hao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Bridge的纯自回归统一多模态大语言模型，通过Mixture-of-Transformers架构和语义到像素的离散表示，在单一的下一个token预测框架内实现了图像理解和生成，同时在多种多模态基准测试中表现出色，且训练数据和时间需求更少。

Details

Motivation: 现有的多模态大语言模型在统一理解和生成任务时面临挑战：混合方法虽然能生成高质量图像但打破了自回归范式，而纯自回归方法则在语义对齐和像素级保真度之间存在权衡。因此，需要一种既能保持自回归特性又能兼顾生成质量和语义对齐的统一模型。 Method: 提出Bridge模型，采用Mixture-of-Transformers架构，将预训练的视觉理解模型扩展为具备生成能力的统一模型；设计了一种语义到像素的离散表示方法，结合紧凑的语义token和细粒度的像素token，以提升视觉生成的保真度。 Result: 实验表明，Bridge在多个多模态基准测试中取得了具有竞争力或优于现有方法的结果，同时序列长度仅增加7.9%，且所需训练数据和训练时间更少。 Conclusion: Bridge成功实现了在纯自回归框架下的统一多模态理解和生成，平衡了语义对齐与像素级细节描述，是一种高效且高性能的统一多模态大语言模型。 Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

[105] Robust Classification of Oral Cancer with Limited Training Data

Akshay Bhagwan Sonawane,Lena D. Swamikannan,Lakshman Tamil

Main category: cs.CV

TL;DR: 提出一种结合CNN与贝叶斯深度学习的混合模型，用于小样本下的口腔癌分类，通过变分推断实现不确定性量化，显著提升模型在数据稀缺场景下的可靠性与泛化能力。

Details

Motivation: 传统深度学习模型在小样本数据下易过拟合且缺乏可靠性，难以满足医疗资源匮乏地区早期口腔癌诊断的需求。 Method: 结合卷积神经网络（CNN）与贝叶斯深度学习，采用变分推断进行不确定性量化，使用智能手机拍摄的彩色照片训练模型，并在三个不同测试集上评估性能。 Result: 在训练分布相似的测试集上达到94%准确率；在真实世界图像数据上，相比传统CNN的72.94%，该模型在多样化数据集上实现88%准确率，并表现出良好的置信度与不确定性对应关系。 Conclusion: 贝叶斯深度学习能有效提升小样本条件下口腔癌分类模型的可靠性与泛化性能，适用于医疗资源有限地区的早期诊断应用。 Abstract: Oral cancer ranks among the most prevalent cancers globally, with a particularly high mortality rate in regions lacking adequate healthcare access. Early diagnosis is crucial for reducing mortality; however, challenges persist due to limited oral health programs, inadequate infrastructure, and a shortage of healthcare practitioners. Conventional deep learning models, while promising, often rely on point estimates, leading to overconfidence and reduced reliability. Critically, these models require large datasets to mitigate overfitting and ensure generalizability, an unrealistic demand in settings with limited training data. To address these issues, we propose a hybrid model that combines a convolutional neural network (CNN) with Bayesian deep learning for oral cancer classification using small training sets. This approach employs variational inference to enhance reliability through uncertainty quantification. The model was trained on photographic color images captured by smartphones and evaluated on three distinct test datasets. The proposed method achieved 94% accuracy on a test dataset with a distribution similar to that of the training data, comparable to traditional CNN performance. Notably, for real-world photographic image data, despite limitations and variations differing from the training dataset, the proposed model demonstrated superior generalizability, achieving 88% accuracy on diverse datasets compared to 72.94% for traditional CNNs, even with a smaller dataset. Confidence analysis revealed that the model exhibits low uncertainty (high confidence) for correctly classified samples and high uncertainty (low confidence) for misclassified samples. These results underscore the effectiveness of Bayesian inference in data-scarce environments in enhancing early oral cancer diagnosis by improving model reliability and generalizability.

[106] Consistent Assistant Domains Transformer for Source-free Domain Adaptation

Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,and Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于无源域自适应（SFDA）的Consistent Assistant Domains Transformer（CADTrans），通过构建领域一致性不变特征表示来解决无法访问源域数据的问题。

Details

Motivation: 由于无法获取源域数据，现有方法难以获得确定的不变特征，且易受难样本和域偏移影响，因此需要一种能有效提取不变特征并区分难易样本的新方法。 Method: 设计了一个辅助域模块，利用中间聚合全局注意力生成多样化表示；基于辅助域和目标域，采用多种一致性策略获取不变特征表示；引入条件多核最大均值差异（CMK-MMD）对齐难样本与易样本。 Result: 在Office-31、Office-Home、VISDA-C和DomainNet-126等多个基准上进行了广泛实验，结果表明所提方法显著提升了性能。 Conclusion: CADTrans通过构建一致性的不变特征表示和有效的样本对齐策略，在无源域数据的情况下显著提高了域自适应性能，具有较强的鲁棒性和泛化能力。 Abstract: Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.

[107] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Ricardo Gonzalez Penuela,Felipe Arias-Russi,Victor Capriles

Main category: cs.CV

TL;DR: 提出一种基于历史问题的上下文感知系统，通过引导多模态大语言模型生成更符合盲人和低视力用户需求的图像描述，提升了信息相关性和用户体验。

Details

Motivation: 现有MLLMs在为盲人和低视力用户提供图像描述时，往往生成冗长且不相关的文本，缺乏对用户实际需求的针对性，导致交互效率低下。 Method: 利用VizWiz-LF数据集中盲人用户的过往提问，识别输入图像的相似历史视觉上下文，并以此指导多模态大语言模型生成更具上下文相关性的描述。 Result: 在92组描述的评估中，上下文感知描述在76.1%的情况下（70/92）成功预测并回答了用户问题，在54.4%的对比中（50/92）被评估者更偏好。 Conclusion: 通过引入用户历史提问来引导MLLM生成描述，能有效提升对BLV用户的信息相关性和实用性，改善视觉辅助系统的交互效率。 Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

[108] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Krishna Teja Chitty-Venkata,Murali Emani

Main category: cs.CV

TL;DR: 本文提出了ImageNet-Think，一个基于ImageNet21k图像的多模态推理数据集，利用两个先进的视觉语言模型生成结构化思维标记和答案，旨在促进具有显式推理能力的视觉语言模型的发展。

Details

Motivation: 为了提升视觉语言模型（VLMs）的显式推理能力，并推动对多模态推理机制的理解，需要一个包含结构化推理过程的大规模数据集。 Method: 基于25万张ImageNet21k图像，使用GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506两个先进VLM生成每幅图像对应的两组思维-答案序列，构建包含逐步推理过程和最终答案的合成数据集。 Result: 成功构建了ImageNet-Think数据集，包含丰富的结构化思维标记和答案，可用于训练和评估多模态推理模型。 Conclusion: ImageNet-Think为开发更强大的具备显式推理能力的视觉语言模型提供了重要资源，并将促进多模态推理领域的研究发展。 Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

[109] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems

Roman Jacome,Romario Gualdrón-Hurtado,Leon Suarez,Henry Arguello

Main category: cs.CV

TL;DR: 提出了一种名为非线性零空间投影（NPN）的新型正则化方法，通过神经网络将解约束在感知矩阵零空间的低维投影中，提升多种成像反问题的重建精度。

Details

Motivation: 传统先验通常忽略零空间的任务特定结构，导致重建性能受限，因此需要一种能利用零空间结构信息的新正则化方法。 Method: 提出NPN方法，利用神经网络学习感知矩阵零空间的低维投影，并将其作为正则项融入重建过程，适用于多种反问题和现有框架。 Result: 理论证明了收敛性和重建准确性，在压缩感知、去模糊、超分辨率、CT和MRI等多种任务中，NPN均显著提升了重建质量。 Conclusion: NPN是一种可解释、灵活且有效的正则化策略，能够充分利用零空间结构，增强各类成像反问题的重建性能。 Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix's null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.

[110] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics

Zijun Li,Jinchang Zhang,Ming Zhang,Guoyu Lu

Main category: cs.CV

TL;DR: 提出一种将DNA序列转化为可操作决策的自动化基因组解释模块，结合CGR与概念瓶颈模型，实现高精度、可解释的HIV亚型分类，并通过成本感知推荐优化临床决策。

Details

Motivation: 为了在基因组医学中实现可靠、可解释且能集成到自动化系统中的基因组解读方法，解决传统模型缺乏生物学可解释性和临床实用性的问题。 Method: 采用混沌游戏表示法（CGR）提取序列特征，结合概念瓶颈模型（CBM），引入GC含量、CpG密度和k-mer等生物概念，并通过概念保真监督、先验一致性对齐、KL分布匹配和不确定性校准提升模型可靠性。 Result: 在内部和LANL数据集上实现了最先进的HIV亚型分类性能，显著提高概念预测保真度和决策成本效益，减少不必要的重测。 Conclusion: 该框架成功连接了可解释基因组建模与自动化决策，为机器人化和临床自动化提供了可靠的基因组分析基础。 Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.

[111] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Angen Ye,Zeyu Zhang,Boyuan Wang,Xiaofeng Wang,Dapeng Zhang,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出了VLA-R1，一种增强推理能力的视觉-语言-动作（VLA）模型，结合可验证奖励的强化学习（RLVR）与组相对策略优化（GRPO），并在新构建的高质量数据集VLA-CoT-13K上进行训练，显著提升了跨任务、跨场景的泛化能力及真实世界机器人平台的表现。

Details

Motivation: 现有VLA模型缺乏显式的逐步推理机制，且后训练流程对推理质量提升有限，难以考虑物理可用性与几何关系约束。 Method: 提出VLA-R1模型，采用基于可验证奖励的强化学习（RLVR）和组相对策略优化（GRPO），设计针对区域对齐、轨迹一致性和输出格式的可验证奖励，并引入包含思维链监督的VLA-CoT-13K数据集以增强推理与执行能力。 Result: 在领域内、领域外、仿真和真实机器人平台上广泛评估，VLA-R1在推理鲁棒性、执行准确性和泛化性能上均优于先前方法。 Conclusion: 通过引入可验证奖励的强化学习与高质量思维链数据，VLA-R1有效增强了VLA模型的推理与执行能力，推动了具身智能体在复杂环境中的实际应用。 Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

[112] Joint Deblurring and 3D Reconstruction for Macrophotography

Yifan Zhao,Liangchen Li,Yuqi Zhou,Kai Wang,Yan Liang,Juyong Zhang

Main category: cs.CV

TL;DR: 提出了一种用于微距摄影的联合去模糊和3D重建方法，通过可微渲染实现自监督优化，仅需少量多视角模糊图像即可实现高质量去模糊和高保真3D重建。

Details

Motivation: 微距摄影中离焦模糊问题长期存在，严重影响成像清晰度和高质量3D重建；传统去模糊方法依赖大量图像和标注，且缺乏针对微距摄影的多视角3D重建方法。 Method: 基于多视角模糊图像，联合优化物体的清晰3D模型和每个像素的离焦模糊核，采用可微渲染方法进行自监督优化。 Result: 实验表明，该方法仅需少量多视角图像，即可实现高质量的图像去模糊和高保真度的3D外观重建。 Conclusion: 所提方法有效解决了微距摄影中的离焦模糊问题，在去模糊和3D重建方面均表现出优异性能，具有实际应用潜力。 Abstract: Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.

[113] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring

Xiaoyang Liu,Zhengyan Zhou,Zihang Xu,Jiezhang Cao,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出FideDiff，一种新颖的单步扩散模型，用于高保真图像去模糊。通过将运动去模糊重构为扩散过程，并结合Kernel ControlNet与自适应时间步预测，实现了高质量、快速的去模糊效果。

Details

Motivation: 尽管扩散模型在图像恢复中表现出强大生成能力，但推理时间长和保真度不足限制了其应用。因此，需要一种高效且高保真的去模糊方法。 Method: 将去模糊建模为扩散过程，每个时间步代表逐渐模糊的图像；训练一致性模型使所有时间步对齐到同一清晰图像，并利用重建数据学习时间一致性，实现单步去模糊。引入Kernel ControlNet估计模糊核，并采用自适应时间步预测提升性能。 Result: FideDiff在全参考指标上优于以往基于扩散的方法，性能媲美最先进的非扩散模型，显著缩短推理时间并保持高保真度。 Conclusion: FideDiff为预训练扩散模型在高保真图像恢复中的应用提供了新方向，建立了面向实际工业应用的强健基线。 Abstract: Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be available at https://github.com/xyLiu339/FideDiff.

[114] LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

Rixin Zhou,Peiqiang Qiu,Qian Zhang,Chuntao Li,Xi Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于LadderMoE增强的两阶段检测-识别管道，用于解决青铜器铭文自动识别中的跨域差异和长尾分布难题，并构建了大规模数据集进行验证。

Details

Motivation: 青铜器铭文是早期汉字的重要形式，但其自动识别面临严重退化、多模态差异和字符长尾分布等挑战，现有方法难以有效应对。 Method: 构建包含22454张全页图像和198598个标注字符的大规模数据集，采用两阶段检测-识别流程，并引入LadderMoE机制增强预训练CLIP编码器，实现动态专家专业化以提升跨域鲁棒性。 Result: 在单字符和全页识别任务上显著优于现有最先进场景文字识别方法，在头部、中部和尾部字符类别及所有采集模态下均表现出更高准确率。 Conclusion: 所提方法为青铜器铭文识别建立了强有力的基准，支持后续考古学分析。 Abstract: Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

[115] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen,Dat Nguyen

Main category: cs.CV

TL;DR: 提出VirDA方法，通过在骨干网络前添加域特定的视觉重编程层进行无参数微调的域自适应，显著减少可训练参数并保持高性能。

Details

Motivation: 现有UDA方法对每个新源-目标域对微调整个骨干网络，导致参数和存储开销线性增长，且无法复用预训练骨干。受骨干网络存在纹理偏好的启发，希望利用域特定纹理偏好实现高效域适应。 Method: 提出VirDA， prepend一个域特定的视觉重编程层生成视觉提示，作为输入图像的纹理偏好调整其风格以适应目标域；使用多个目标函数优化域内和域间分布差异，无需修改骨干参数。 Result: 在Office-31上达到92.8%平均准确率，仅用1.5M可训练参数；相比PDA提升1.6%精度且仅用46%参数；相比全微调方法CDTrans和FixBi分别提升0.2%和1.4%，但仅需1.7%和2.8%参数；相比PMTrans和TVT仅牺牲2.2%和1.1%精度，但参数量仅为约1.7%。 Conclusion: VirDA通过视觉重编程实现高效的无微调域适应，在大幅降低参数量的同时保持竞争力的性能，支持骨干网络跨域复用，适合资源受限场景。 Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA.Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its ``style'' to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[116] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Minh Tran,Maksim Siniukov,Zhangyu Jin,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出了一种名为离散面部编码（DFE）的无监督、数据驱动方法，用于从3D网格序列中学习紧凑且可解释的面部表情字典，通过残差向量量化变分自编码器（RVQ-VAE）实现，在压力检测、人格预测和抑郁检测等心理任务中优于FACS和其他现有方法。

Details

Motivation: 现有的面部表情编码系统（如FACS）受限于覆盖范围有限和人工标注成本高，难以满足大规模心理与情感计算应用的需求。 Method: 首先使用3D可变形模型（3DMM）从图像中提取与身份无关的表情特征，分离头部姿态和面部几何等因素；然后利用残差向量量化变分自编码器（RVQ-VAE）对这些特征进行编码，生成来自共享码本的离散token序列，每个token代表一种可复用的面部形变模式。 Result: 实验表明，DFE比FACS和其他面部编码方法能更精确地捕捉面部行为；在压力检测、人格预测和抑郁检测三个高级心理任务中，基于DFE的Bag-of-Words模型 consistently 优于FACS基线和Masked Autoencoders等强表征学习模型；分析还显示DFE覆盖更多样化的面部表现。 Conclusion: DFE是一种可扩展且有效的FACS替代方案，具有在心理学和情感计算应用中广泛使用的潜力。 Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

[117] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale

Yongbo Chen,Yanhao Zhang,Shaifali Parashar,Liang Zhao,Shoudong Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为Con-NRSfM的新方法，用于处理共形变形下的非刚性结构从运动（NRSfM）问题，能够在消除传统假设限制的同时准确估计局部共形尺度和深度，并结合自监督学习生成带纹理的稠密3D点云，实验表明该方法在重建精度和鲁棒性上优于现有方法。

Details

Motivation: 现有的NRSfM方法依赖于局部平面或线性变形等强假设，且无法恢复共形尺度，限制了在单目视觉可变形SLAM中的应用精度与适用范围。 Method: 提出Con-NRSfM方法，基于图结构框架利用优化后的2D图像 warp 进行逐点重建，解耦深度与共形尺度约束，并采用并行可分的迭代优化策略；结合编码器-解码器网络实现自监督稠密3D重建。 Result: 在合成与真实数据集上的仿真和实验结果显示，该方法在重建精度和鲁棒性方面优于现有方法，能够准确估计局部共形尺度和深度。 Conclusion: Con-NRSfM有效解决了传统NRSfM方法在共形变形下的局限性，通过解耦约束和优化策略提升了重建性能，结合自监督学习实现了高质量的稠密带纹理3D重建。 Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.

[118] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao,Hongrui Wu,Ziyong Feng,Hujun Bao,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为UniVerse的统一框架，用于从不一致的多视角图像中进行鲁棒的3D场景重建。该方法将重建任务解耦为修复和重建两个子任务，利用视频扩散模型从大规模数据中学习通用场景先验，从而有效处理多种图像不一致性，并在合成和真实数据集上表现出优越的泛化能力和性能。

Details

Motivation: 现有的神经3D场景表示方法在处理不一致多视图图像时依赖密集观测，难以稳健优化参数。因此，需要一种更鲁棒且通用的方法来应对各种图像不一致性。 Method: 提出UniVerse框架，首先将不一致图像转换为初始视频，然后使用专门设计的视频扩散模型恢复出一致图像，最后基于恢复后的图像进行3D重建。该方法通过扩散模型学习大规模数据中的通用场景先验，避免了逐视图退化建模的局限性。 Result: 在合成与真实世界数据集上的实验表明，UniVerse在鲁棒重建方面具有出色的泛化能力和优于现有方法的性能，同时还能控制重建3D场景的风格。 Conclusion: UniVerse通过解耦修复与重建过程，并引入基于视频扩散模型的通用先验，有效提升了在不一致输入下的3D场景重建鲁棒性与灵活性。 Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

[119] An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

Ke Jia,Ji Zhou,Hanxin Li,Zhigan Zhou,Haojie Chu,Xiaojie Li

Main category: cs.CV

TL;DR: 提出一种轻量级端到端模板匹配框架，将匹配任务重构为联合定位与几何回归，输出目标的中心坐标、旋转角度及独立缩放比例，通过模板感知动态卷积和无需几何标注的训练策略实现高效精确的工业场景匹配。

Details

Motivation: 传统方法依赖角度和尺度的穷举搜索，效率低；现有深度学习方法缺乏对几何姿态的显式建模，难以满足实际工业部署需求。 Method: 将模板匹配 reformulate 为联合定位与几何回归任务，设计模板感知动态卷积模块（TDCM）在推理时注入模板特征，采用深度可分离卷积和像素重排提升效率，并通过基于旋转-剪切的数据增强与结构感知伪标签实现无几何标注训练，辅以轻量级细化模块优化角度和尺度精度。 Result: 模型仅3.07M，推理速度14ms，在复合变换下保持高精度，且在小模板和多目标场景中表现出强鲁棒性。 Conclusion: 所提方法在效率、精度和泛化能力之间取得良好平衡，适用于实时工业检测与部件对齐应用。 Abstract: In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

[120] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Xuchen Li,Xuzhao Li,Jiahui Gao,Renjie Pi,Shiyu Hu,Wentao Zhang

Main category: cs.CV

TL;DR: 提出一种自适应像素推理框架，通过操作感知的监督微调和回溯引导的强化学习，动态决定何时使用像素级操作，在提升性能的同时显著减少不必要的视觉操作。

Details

Motivation: 现有视觉语言模型在处理需要精细视觉理解的任务时表现不佳，主要由于图像编码过程中的信息丢失或对关键区域关注不足，且引入像素级信息常导致效率低下和分心。 Method: 首先采用操作感知的监督微调建立文本推理和视觉操作的基础能力，然后设计基于模型自身反馈的回溯引导强化学习框架，使其能根据查询难度动态决定是否调用像素操作。 Result: 在多个多模态推理基准上表现出色，HR-Bench 4K准确率达到73.4%，工具使用率仅为20.1%，相比先前方法准确率提升且工具使用减少了66.5%。 Conclusion: 该自适应像素推理框架能有效平衡性能与计算效率，显著提升VLM在细粒度视觉任务中的表现。 Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model's own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.

[121] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu,Wei-Ning Chiu,Shun-Ting Chang,Meng-Ping Huang,Takeshi Tohyama,Ahram Han,Po-Chih Kuo

Main category: cs.CV

TL;DR: 提出了一种基于增强敏感性的风险评分框架（ASRS），通过测量临床合理的旋转下图像嵌入的变化来识别易出错的胸片样本，从而提升医学AI的公平性和安全性。

Details

Motivation: 深度学习模型在胸片解读中存在跨患者亚群准确性不均的问题，传统误差检测方法难以发现分布内细微错误，缺乏有效的无标签误差识别手段。 Method: 提出ASRS框架，对胸片施加±15°/±30°的旋转增强，利用RAD-DINO编码器测量嵌入空间变化，计算敏感性得分并划分稳定性四分位数以识别高风险样本。 Result: 高敏感性样本的召回率显著降低（-0.2至-0.3），尽管模型整体AUROC和置信度较高；ASRS能有效识别易出错案例，支持选择性预测和医生复核。 Conclusion: ASRS提供了一种无需标签的误差检测方法，在不影响整体性能的前提下提升了医学AI系统的公平性与可靠性，具有临床部署潜力。 Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

[122] FreeViS: Training-free Video Stylization with Inconsistent References

Jiacong Xu,Yiqun Mei,Ke Zhang,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频风格化框架FreeViS，通过融合多个风格参考图像到预训练的图像到视频模型中，实现高质量、时间连贯的视频风格化。

Details

Motivation: 现有视频风格化方法在时间一致性或风格丰富性上存在不足，且专用模型训练成本高、依赖配对数据。因此需要一种无需训练、高效且高质量的解决方案。 Method: FreeViS将多个风格参考图像集成到预训练的图像到视频生成模型中，结合高频补偿约束内容结构与运动，并利用基于光流的运动线索保留低显著区域的风格纹理，从而提升时间一致性和风格细节。 Result: FreeViS在保持高风格保真度的同时显著提升了时间连贯性，优于近期基线方法，在人类偏好测试中表现更优。 Conclusion: FreeViS提供了一种实用且经济的高质量视频风格化方案，无需额外训练，具有良好的应用前景。 Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

[123] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu,Jinjie Wei,Wanying Qu,Chenglong Ma,Junzhi Ning,Yunheng Li,Ying Chen,Xinzhe Luo,Pengcheng Chen,Xin Gao,Ming Hu,Huihui Xu,Xin Wang,Shujian Gao,Dingkang Yang,Zhongying Deng,Jin Ye,Lihao Liu,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出了MedQ-Bench，一个基于多模态大语言模型（MLLM）的医学图像质量评估新范式，涵盖感知与推理双任务，并通过多维评判协议和临床医生对比验证，揭示现有MLLM在医学图像质量评估中能力尚不稳定的现状。

Details

Motivation: 现有医学图像质量评估方法依赖于标量评分指标，无法反映专家评估中的人类推理过程，缺乏对MLLM在该领域系统性评测的基准。 Method: 构建MedQ-Bench基准，包含MedQ-Perception（低层次感知问题）和MedQ-Reasoning（无参考与比较推理任务），覆盖五种成像模态和四十多个质量属性；提出四维度评判协议，并进行人类放射科医生与LLM判断的一致性验证。 Result: 评估了14种最先进的MLLM，发现其在感知和推理任务上表现初步但不稳定，准确率不足以支持可靠临床应用；验证显示AI与人类判断存在差距。 Conclusion: MedQ-Bench建立了面向语言化评估的医学图像质量评测新范式，揭示当前MLLM在此类任务中的局限性，呼吁针对医学图像质量评估任务优化MLLM。 Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

[124] Holistic Order Prediction in Natural Scenes

Pierre Musacchio,Hyunmin Lee,Jaesik Park

Main category: cs.CV

TL;DR: 提出InstaFormer，一种仅通过单次前向传播即可从RGB图像预测场景中所有实例的遮挡和深度顺序的网络，显著降低了计算成本和输入要求。

Details

Motivation: 现有方法依赖昂贵的输入格式（如类别标签、二值分割掩码）和高推理成本（二次方前向传播次数），难以在实际中广泛应用。 Method: 设计InstaFormer，利用对象查询与潜在掩码描述符之间的交互，共同表示同一对象并提供互补信息，实现对实例间完整顺序关系的端到端预测。 Result: 在多个基准上进行了全面评估和消融实验，验证了方法的有效性，能够在单次前向传播中准确预测所有实例的遮挡和深度顺序。 Conclusion: InstaFormer有效克服了传统方法在输入需求和计算开销上的局限，为实例级几何理解提供了高效且实用的解决方案。 Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.

[125] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

Raahul Krishna Durairaju,K. Saruladha

Main category: cs.CV

TL;DR: 本文提出了一种基于金字塔位置编码和强化学习的新型神经风格迁移框架PyramidStyler，显著提升了高分辨率图像和复杂风格下的生成效率与质量。

Details

Motivation: 现有的CNN和基于Transformer的神经风格迁移模型在处理复杂风格和高分辨率图像时扩展性差、计算开销大，难以实现实时高质量渲染。 Method: 提出PyramidStyler，引入金字塔位置编码（PPE）以多尺度方式捕捉局部细节和全局上下文，并结合强化学习动态优化风格化过程，加速收敛。模型在Microsoft COCO和WikiArt数据集上训练。 Result: 经过4000轮训练后，内容损失降低62.6%至2.07，风格损失降低57.4%至0.86，推理时间为1.39秒；加入强化学习后进一步提升至内容损失2.03、风格损失0.75，推理时间仅1.40秒。 Conclusion: PyramidStyler实现了高效、高质量的实时艺术风格迁移，在媒体与设计领域具有广泛应用前景。 Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

[126] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction

Sheng-Hsiang Hung,Ting-Yu Yen,Wei-Fang Sun,Simon See,Shih-Hsuan Hung,Hung-Kuo Chu

Main category: cs.CV

TL;DR: 本文提出了LoBE-GS，一种面向大规模场景的负载均衡且高效的3D高斯点阵化框架，通过深度感知分割、优化分配和轻量训练技术，显著提升了训练效率并支持更大规模场景重建。

Details

Motivation: 现有的3D高斯点阵化方法在处理大范围或开放场景时存在内存压力大、分区负载不均和粗到精流程效率低的问题，难以扩展到城市级场景。 Method: 提出LoBE-GS框架，包括深度感知的场景分割方法、基于优化的可见高斯分布均衡策略，以及可见性裁剪和选择性稠密化两种轻量技术，以提升训练效率和负载均衡。 Result: 在大规模城市场景和户外数据集上实验表明，LoBE-GS相较现有最先进方法可实现最高2倍的端到端训练加速，同时保持重建质量，并能处理传统3DGS无法应对的大规模场景。 Conclusion: LoBE-GS有效解决了3D高斯点阵化在大规模场景下的扩展性与效率瓶颈，为实时高保真大场景重建提供了可行方案。 Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians -- a strong proxy for computational load -- across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.

[127] Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu,Guozhen Zhang,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Xuming He

Main category: cs.CV

TL;DR: 本文提出了MemoryPack和Direct Forcing两种方法，以解决长视频生成中的长程依赖建模和自回归解码中的误差累积问题，显著提升了生成结果的时序一致性和可靠性。

Details

Motivation: 长视频生成面临两个主要挑战：一是如何有效建模长程时间依赖，二是避免自回归解码过程中误差的不断累积。现有方法在这两方面表现不足，限制了生成质量与实用性。 Method: 提出MemoryPack，一种可学习的上下文检索机制，结合文本和图像信息作为全局引导，统一建模短期和长期依赖；同时引入Direct Forcing，一种高效的单步近似策略，提升训练与推理的一致性，减少推理过程中的误差传播。 Result: MemoryPack实现了分钟级的时间一致性，且计算复杂度保持线性；Direct Forcing有效抑制了误差累积。两者结合显著提升了长视频生成的上下文一致性和生成可靠性。 Conclusion: MemoryPack与Direct Forcing共同增强了自回归视频生成模型在长视频任务中的性能，推动了其在实际应用中的可用性。 Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

[128] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving

Cornelius Schröder,Marius-Raphael Schlüter,Markus Lienkamp

Main category: cs.CV

TL;DR: 本文研究了3D目标检测器分类任务中的置信度校准问题，提出两种正则化损失项以改善训练过程中的校准效果，并结合等渗回归方法在多个模型上进行了评估。

Details

Motivation: 精确的对象检测与不确定性估计对自动驾驶系统至关重要，现有方法在预测置信度校准方面存在不足，尤其是对主导类和次级类预测的校准不充分。 Method: 提出了两个辅助正则化损失项，分别用于校准主导预测和完整预测向量；结合后处理方法（如等渗回归）进行评估，在CenterPoint、PillarNet和DSVT-Pillar上测试性能。 Result: 结合全类别预测校准的损失项与等渗回归方法，在CenterPoint和PillarNet上实现了最佳的主导类和次级类预测校准效果；但DSVT-Pillar无法通过相同方法同时校准两类预测。 Conclusion: 校准完整的预测置信度分布对于提升3D目标检测器的可靠性是必要且有效的，所提方法显著改善了模型的置信度校准性能。 Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.

[129] Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim,Sooyoung Yang,Jihyong Oh,Myungjoo Kang,Chanho Eom

Main category: cs.CV

TL;DR: 提出DiffPS框架，利用预训练扩散模型解决行人搜索中检测与重识别任务的优化冲突，通过三个专用模块提升性能，在CUHK-SYSU和PRW数据集上达到SOTA。

Details

Motivation: 现有方法使用ImageNet预训练主干网络且共享特征导致检测与重识别任务间优化目标冲突，难以有效捕捉复杂空间上下文和细粒度身份线索。 Method: 提出DiffPS框架，利用扩散先验知识，设计三个模块：扩散引导区域建议网络（DGRPN）增强定位，多尺度频率优化网络（MSFRN）缓解形状偏置，语义自适应特征聚合网络（SFAN）利用文本对齐的扩散特征。 Result: 在CUHK-SYSU和PRW两个主流行人搜索数据集上取得新的最先进性能。 Conclusion: DiffPS有效解决了子任务间的优化冲突，充分利用扩散模型的先验知识，在行人搜索任务中显著提升了检测与重识别的联合性能。 Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

[130] Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction

Yi Ai,Yuanhao Cai,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 提出了一种新的高光谱成像重建方法FMU，首次将流匹配与深度展开网络结合，通过引入均速度损失增强流的全局一致性，在仿真和真实数据集上显著优于现有方法。

Details

Motivation: 高光谱成像虽然提供丰富的空间-光谱信息，但受限于硬件和从压缩测量中重建三维数据的困难，导致获取成本高；现有的压缩感知系统（如CASSI）在重建时仍面临严重退化和光谱细节丢失问题。 Method: 提出Flow-Matching-guided Unfolding网络（FMU），将流匹配的生成先验嵌入深度展开框架，并引入均速度损失以增强流的全局一致性，结合优化方法的可解释性和流匹配的生成能力。 Result: 在仿真和真实数据集上的实验表明，FMU在重建质量上显著优于现有方法。 Conclusion: FMU通过融合流匹配与展开网络，有效提升了高光谱图像重建的精度和鲁棒性，为压缩感知下的高光谱成像提供了新思路。 Abstract: Hyperspectral imaging (HSI) provides rich spatial-spectral information but remains costly to acquire due to hardware limitations and the difficulty of reconstructing three-dimensional data from compressed measurements. Although compressive sensing systems such as CASSI improve efficiency, accurate reconstruction is still challenged by severe degradation and loss of fine spectral details. We propose the Flow-Matching-guided Unfolding network (FMU), which, to our knowledge, is the first to integrate flow matching into HSI reconstruction by embedding its generative prior within a deep unfolding framework. To further strengthen the learned dynamics, we introduce a mean velocity loss that enforces global consistency of the flow, leading to a more robust and accurate reconstruction. This hybrid design leverages the interpretability of optimization-based methods and the generative capacity of flow matching. Extensive experiments on both simulated and real datasets show that FMU significantly outperforms existing approaches in reconstruction quality. Code and models will be available at https://github.com/YiAi03/FMU.

[131] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

Wei-Lung Mao,Chun-Chi Wang,Po-Heng Chou,Yen-Ting Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习和生成对抗网络的自动化双列直插式封装（DIP）缺陷检测系统，结合YOLO模型与ConSinGAN数据增强，在表面和引脚缺陷检测中实现了高准确率（95.50%）和快速检测（285ms），并集成了SCADA系统，适用于缺陷样本不足的工业场景。

Details

Motivation: 传统工业元件缺陷检测耗时且依赖人力，导致质检负担重、质量控制困难，亟需一种高效、自动化的检测方法。 Method: 采用数字相机光学系统采集图像，利用ConSinGAN生成缺陷样本以解决数据不足问题，比较YOLOv3、v4、v7、v9四种模型，并结合ConSinGAN进行数据增强，最终构建自动化检测系统并集成SCADA监控架构。 Result: YOLOv7结合ConSinGAN在准确率达到95.50%，检测时间为285ms，显著优于其他YOLO版本和基于阈值的传统方法，且系统可扩展性强，适用于多种缺陷类型和数据稀缺场景。 Conclusion: 所提出的基于ConSinGAN增强的YOLOv7模型能有效实现DIP元件的高效、精准缺陷检测，具备良好的工业应用前景和推广价值。 Abstract: Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50\%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.

[132] Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Guangyao Zhai,Yue Zhou,Xinyan Deng,Lars Heckler,Nassir Navab,Benjamin Busam

Main category: cs.CV

TL;DR: 本文提出了一种名为FoundAD的少样本异常检测方法，利用大规模预训练视觉编码器学习到的正常图像分布特性，通过非线性投影算子将特征映射到自然图像流形上，从而有效识别图像中的异常区域。

Details

Motivation: 少样本条件下准确区分正常与异常特征具有挑战性，尤其是在类别不可知的情况下。现有方法因参数量大或泛化能力不足而受限。 Method: 基于大规模预训练的视觉编码器，设计了一个非线性投影算子，用于将学习到的嵌入映射到自然图像流形上，利用异常程度与嵌入差异的相关性进行异常检测。 Result: 该方法在多类异常检测任务中表现出色，性能优于或媲美先前方法，且使用更少的参数。通过多种基础编码器（包括最新的DINOv3）验证了其有效性。 Conclusion: FoundAD为利用基础模型特征进行少样本异常检测提供了新视角，推动了该领域的发展。 Abstract: Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

[133] ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello,Ronja Güldenring,Lazaros Nalpantidis

Main category: cs.CV

TL;DR: 提出ClustViT，通过可训练的聚类模块合并相似token并用重建模块恢复细节，显著降低计算量和推理时间，同时保持语义分割精度。

Details

Motivation: Vision Transformers在机器人系统中应用受限于其二次注意力复杂度，且现有token合并方法不适用于密集预测任务。 Method: 在ViT基础上引入可训练的Cluster模块，基于分割掩码的伪聚类合并相似token，并设计Regenerator模块恢复精细细节。 Result: 在三个数据集上实现了最多2.18倍的GFLOPs减少和1.64倍的推理加速，同时保持相当的分割精度。 Conclusion: ClustViT有效平衡了计算效率与语义分割性能，适用于资源受限的实际机器人系统。 Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Yongyi Su,Haojie Zhang,Shijie Li,Nanqing Liu,Jingyi Liao,Junyi Pan,Yuan Liu,Xiaofen Xing,Chong Sun,Chen Li,Nancy F. Chen,Shuicheng Yan,Xulei Yang,Xun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Patch-as-Decodable Token (PaDT)的统一范式，使多模态大语言模型（MLLMs）能够直接生成文本和多种视觉输出，通过引入视觉参考令牌（VRTs）和轻量解码器，在检测、分割和定位任务中实现了最先进的性能。

Details

Motivation: 现有的MLLM在视觉任务中依赖间接表示（如用文本生成坐标），限制了性能，难以完成密集预测任务（如分割）。因此需要一种能直接生成多样化视觉输出的统一框架。 Method: 提出PaDT框架，利用从图像块嵌入生成的视觉参考令牌（VRTs），将其与LLM的文本输出令牌无缝交织，并通过轻量级解码器将LLM输出转化为检测、分割和定位结果；VRT在每次前向传播中独立处理，并动态扩展嵌入表以提升定位和区分能力；训练时采用随机选择VRT进行监督微调及鲁棒的逐令牌交叉熵损失。 Result: 在四个视觉感知与理解任务上的实验表明，PaDT在性能上 consistently 达到最先进水平，甚至优于规模更大的MLLM模型。 Conclusion: PaDT为MLLM实现统一的多模态生成提供了有效解决方案，显著提升了视觉密集预测任务的表现，具有良好的可扩展性和应用潜力。 Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

[135] TriAlignXA: An Explainable Trilemma Alignment Framework for Trustworthy Agri-product Grading

Jianfei Xie,Ziyang Li

Main category: cs.CV

TL;DR: 本文提出了一种可解释AI框架TriAlignXA，通过三引擎优化和预映射机制，在农产品质量、时效性和经济性构成的“不可能三角”中实现平衡，提升在线果蔬电商中的消费者信任。

Details

Motivation: 解决在线果蔬电商中存在的“信任赤字”问题，因数字交易无法提供对产品质量的直接感官感知，导致消费者信任不足。 Method: 构建“信任金字塔”模型，提出“三角信任指数”（TTI），设计TriAlignXA可解释AI框架，包含生物适应引擎、时效优化引擎和经济优化引擎，并引入预映射机制将过程数据编码为QR码以增强透明度。 Result: 实验表明该框架在分级任务中显著优于基线模型，验证了其在‘不可能三角’中的平衡能力，并通过实证与理论分析确认了对信任构建的有效支持。 Conclusion: TriAlignXA框架为建立可信的在线农产品生态系统提供了从理论到实践的完整支持，实现了从算法决策到消费者信任的关键转化路径。 Abstract: The 'trust deficit' in online fruit and vegetable e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality. This paper constructs a 'Trust Pyramid' model through 'dual-source verification' of consumer trust. Experiments confirm that quality is the cornerstone of trust. The study reveals an 'impossible triangle' in agricultural product grading, comprising biological characteristics, timeliness, and economic viability, highlighting the limitations of traditional absolute grading standards. To quantitatively assess this trade-off, we propose the 'Triangular Trust Index' (TTI). We redefine the role of algorithms from 'decision-makers' to 'providers of transparent decision-making bases', designing the explainable AI framework--TriAlignXA. This framework supports trustworthy online transactions within agricultural constraints through multi-objective optimization. Its core relies on three engines: the Bio-Adaptive Engine for granular quality description; the Timeliness Optimization Engine for processing efficiency; and the Economic Optimization Engine for cost control. Additionally, the "Pre-Mapping Mechanism" encodes process data into QR codes, transparently conveying quality information. Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. Empirical evidence and theoretical analysis verify the framework's balancing capability in addressing the "impossible triangle". This research provides comprehensive support--from theory to practice--for building a trustworthy online produce ecosystem, establishing a critical pathway from algorithmic decision-making to consumer trust.

[136] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

Lei Liu,Can Wang,Zhenghao Chen,Dong Xu

Main category: cs.CV

TL;DR: 提出4DGS-Craft，一个一致且可交互的4D高斯泼溅编辑框架，通过4D感知的InstructPix2Pix模型、多视图网格模块和高斯选择机制，实现视图、时间及非编辑区域的一致性，并利用LLM理解并分解复杂用户指令。

Details

Motivation: 现有4D高斯泼溅编辑方法在视图、时间、非编辑区域一致性以及处理复杂文本指令方面存在挑战。 Method: 引入4D感知的InstructPix2Pix模型，融合4D VGGT几何特征；设计多视图网格模块以迭代优化多视图图像和4D场景；提出高斯选择机制仅优化编辑区域内的高斯；结合基于LLM的用户意图理解模块，将复杂指令分解为原子操作序列。 Result: 相比现有方法，该方法在4D场景编辑中实现了更高的一致性和可控性，能有效处理复杂文本指令并保持非编辑区域稳定。 Conclusion: 4DGS-Craft框架显著提升了4D高斯泼溅编辑的一致性、交互性和复杂指令处理能力，推动了4D内容创作的发展。 Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.

[137] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution

Junyu Wu,Jie Tang,Jie Liu,Gangshan Wu

Main category: cs.CV

TL;DR: 提出了一种名为Pure-Pass（PP）的像素级掩码机制，用于图像超分辨率任务中减少计算开销，通过识别“纯像素”并跳过其复杂计算，在保持高性能的同时实现细粒度和空间灵活的掩码。

Details

Motivation: 现有轻量级图像超分方法如CAMixer存在适应性差、掩码粗粒度和空间灵活性不足等问题，需设计更高效、精细的计算分配机制。 Method: 提出Pure-Pass（PP），利用固定颜色中心点对像素进行分类，识别可跳过的纯像素，实现像素级掩码，并集成到ATD-light模型中。 Result: PP-ATD-light在相似计算节省下，优于CAMixer-ATD-light的重建质量和参数效率。 Conclusion: Pure-Pass机制在保持低计算开销的同时，提升了图像超分模型的性能与灵活性，具有良好的实用性。 Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.

[138] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa,Ryo Takahashi,Tomoya Kitano,Yukihiro Iida,Chisako Muramatsu,Tatsuro Hayashi,Yuta Seino,Xiangrong Zhou,Takeshi Hara,Akitoshi Katsumata,Hiroshi Fujita

Main category: cs.CV

TL;DR: 本研究提出了一种基于自校正循环与结构化输出（SLSO）框架的方法，利用GPT-4o的多模态能力自动生成牙科全景片中颌骨囊肿的影像发现，相较于传统思维链方法在多个评估项目上提升了准确性，尤其在牙齿编号、牙齿移位和牙根吸收方面表现更优。

Details

Motivation: 为了提高AI在医学影像报告生成中的准确性和可靠性，减少幻觉现象，并提升对颌骨囊肿关键特征的识别精度。 Method: 构建了一个包含10个步骤的SLSO框架，结合GPT-4o的多模态分析能力，通过图像输入、结构化数据生成、牙齿编号提取与一致性检查、不一致时迭代重生成等机制实现自动发现生成，并与传统的Chain-of-Thought（CoT）方法进行比较。 Result: SLSO框架在牙齿编号、牙齿移位和牙根吸收上的准确率分别提高了66.9%、33.3%和28.6%；最多经过五次迭代可获得一致的结构化输出，有效抑制了幻觉并增强了阴性描述的完整性，但在跨多颗牙齿的大范围病变识别上仍有局限。 Conclusion: SLSO框架能有效提升颌骨囊肿影像自动报告的准确性与结构一致性，尽管样本量较小未达统计显著性，但仍展示了其潜力，未来需进一步优化以实现临床实用化。 Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.

[139] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction

Mario Resino,Borja Pérez,Jaime Godoy,Abdulla Al-Kaff,Fernando García

Main category: cs.CV

TL;DR: 提出了一种名为LiLa-Net的3D自编码器架构，仅使用LiDAR点云数据高效编码真实交通环境特征，并通过简化跳跃连接和减少编码层实现高性能重建与良好泛化能力。

Details

Motivation: 为了在不依赖复杂资源的情况下，从真实交通环境中有效提取LiDAR点云特征并实现高质量重建。 Method: 设计了一种轻量化的3D自编码器LiLa-Net，采用简化的跳跃连接结构并减少编码器层数，优化跳接信息与潜在编码之间的平衡。 Result: 模型能够在保持高效性能的同时准确重建原始点云，并展现出对非交通环境物体的良好泛化能力。 Conclusion: LiLa-Net在资源受限条件下实现了高质量点云重建与强泛化性，验证了简化结构在实际交通场景中的有效性。 Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR's point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.

[140] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring

Jenna Kline,Maksim Kholiavchenko,Samuel Stevens,Nina van Tiel,Alison Zhong,Namrata Banerji,Alec Sheets,Sowbaranika Balasubramaniam,Isla Duporge,Matthew Thompson,Elizabeth Campolongo,Jackson Miliko,Neil Rosser,Tanya Berger-Wolf,Charles V. Stewart,Daniel I. Rubenstein

Main category: cs.CV

TL;DR: 本文提出了一种名为kabr-tools的开源工具包，结合无人机视频与机器学习技术，实现对野生动物多物种行为的自动化监测，显著提升了行为观测的精度与规模。

Details

Motivation: 传统野外观察方法在范围、时间和人力上受限，难以全面捕捉复杂的行为模式，因此需要一种可扩展的技术手段来提升野生动物行为生态学研究的能力。 Method: 开发了一个集成无人机视频与机器学习系统的分析框架kabr-tools，包含目标检测、追踪和行为分类模块，用于提取时间分配、行为转换、社会互动、栖息地关联和群体动态等关键行为指标。 Result: 相比地面观测，无人机方法减少了15%的视野丢失，捕捉到更多且更准确连续的行为转换；通过三个案例研究分析了969个行为序列，验证了工具有效性；发现Grevy斑马和普通斑马在警觉性随群体大小变化上具有一致性，但栖息地影响不同，且两者均表现出强烈的行为惯性，并在混合群体中存在空间隔离现象。 Conclusion: kabr-tools实现了大规模自动化行为监测，为生态系统范围的研究、生物多样性保护和生态监测提供了强有力的新工具。 Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy's zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy's zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy's zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.

[141] GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing

Mengtian Li,Yunshu Bai,Yimin Chu,Yijun Shen,Zhongmei Li,Weifeng Ge,Zhifeng Xie,Chaofeng Chen

Main category: cs.CV

TL;DR: 提出了一种名为GaussianMorphing的新框架，用于从多视角图像实现语义感知的3D形状和纹理变形，通过网格引导的3D高斯点阵实现高保真建模，并在无需标注数据的情况下保持几何一致性和纹理保真度。

Details

Motivation: 现有方法依赖点云或需要预定义的同胚映射，难以处理无纹理数据且缺乏语义一致性，因此需要一种能同时保证几何与纹理质量、并自动建立语义对应的方法。 Method: 采用网格引导的3D高斯点阵（3DGS）进行几何与外观建模，将3D高斯锚定到网格片上，结合拓扑感知约束和物理合理的点轨迹，实现统一的形变策略和无监督语义对应。 Result: 在提出的TexMorph基准上显著优于先前2D/3D方法，颜色一致性误差（ΔE）降低22.2%，EI降低26.2%。 Conclusion: GaussianMorphing通过融合网格引导的3D高斯表示与拓扑感知形变策略，实现了高质量、语义一致的3D形态插值，无需标注数据且在几何与纹理保真度方面表现优越。 Abstract: We introduce GaussianMorphing, a novel framework for semantic-aware 3D shape and texture morphing from multi-view images. Previous approaches usually rely on point clouds or require pre-defined homeomorphic mappings for untextured data. Our method overcomes these limitations by leveraging mesh-guided 3D Gaussian Splatting (3DGS) for high-fidelity geometry and appearance modeling. The core of our framework is a unified deformation strategy that anchors 3DGaussians to reconstructed mesh patches, ensuring geometrically consistent transformations while preserving texture fidelity through topology-aware constraints. In parallel, our framework establishes unsupervised semantic correspondence by using the mesh topology as a geometric prior and maintains structural integrity via physically plausible point trajectories. This integrated approach preserves both local detail and global semantic coherence throughout the morphing process with out requiring labeled data. On our proposed TexMorph benchmark, GaussianMorphing substantially outperforms prior 2D/3D methods, reducing color consistency error ($\Delta E$) by 22.2% and EI by 26.2%. Project page: https://baiyunshu.github.io/GAUSSIANMORPHING.github.io/

[142] Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor,Romit Roy Choudhury

Main category: cs.CV

TL;DR: 本文提出了一种名为InPose的新方法，通过仅使用旋转测量值并结合预训练的扩散模型来实现零样本泛化的人体姿态估计。

Details

Motivation: 由于用户体型差异导致位置测量值变化大，现有基于条件扩散模型的姿态估计算法在跨用户场景下泛化能力差。 Method: 将姿态估计建模为逆问题，利用预训练扩散模型仅以旋转测量为条件，并通过由位置测量导出的似然项引导模型先验。 Result: InPose能够在无需针对特定用户训练的情况下，准确估计出与稀疏体上传感器数据最匹配的姿态序列。 Conclusion: 该方法实现了良好的零样本跨用户泛化能力，为实际应用中受限传感器数量下的姿态估计提供了有效解决方案。 Abstract: Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

[143] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation

Arman Behnam

Main category: cs.CV

TL;DR: 提出VGDM：一种基于视觉Transformer的扩散模型，用于脑肿瘤检测与分割，结合全局上下文推理与迭代去噪，提升分割精度和鲁棒性。

Details

Motivation: 传统U-Net在捕捉长距离依赖方面能力有限，难以准确分割复杂肿瘤结构，需更强大的模型来提升性能。 Method: 将视觉Transformer嵌入扩散模型核心，利用Transformer建模全脑MRI体积的空间关系，并通过扩散过程逐步优化分割结果，实现精细化边界恢复。 Result: 在脑肿瘤MRI数据集上实验表明，该方法在Dice相似系数和Hausdorff距离上均优于现有方法，显著提升分割精度。 Conclusion: VGDM通过结合Transformer与扩散模型，有效提升了脑肿瘤检测与分割的准确性与鲁棒性，为医学图像分割提供了新方向。 Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.

[144] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

Walid Rabehi,Marion Le Texier,Rémi Lemoy

Main category: cs.CV

TL;DR: 本研究提出了一种可扩展的深度学习流程，利用双通道U-Net模型从1925-1950年的Scan Histo历史地图中提取法国全国范围的城市用地数据，生成了首个该时期开放获取的国家级城市足迹数据集。

Details

Motivation: 在1970年代之前，由于缺乏全国性的数字化城市足迹数据，法国历史城市蔓延的定量分析受到限制。 Method: 采用双通道U-Net方法：第一阶段生成初步结果并识别混淆区域（如文字和道路），用于指导数据增强；第二阶段利用优化后的数据集和第一阶段的二值化输出减少辐射噪声，降低误检率，并在高性能计算集群上处理941个高分辨率图块。 Result: 成功生成覆盖整个法国本土的城市足迹镶嵌图，总体精度达73%，有效捕捉多种城市模式，抑制了标签和等高线等常见干扰因素。 Conclusion: 该研究填补了早期城市化数据空白，公开发布的代码、训练数据和全国城市栅格数据集有助于推动长期城市化动态的研究。 Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.

[145] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

Woowon Jang,Jiwon Im,Juseung Choi,Niki Rashidian,Wesley De Neve,Utku Ozbulak

Main category: cs.CV

TL;DR: 本文系统分析了在腹腔镜胆囊切除术视频中基于点的跟踪在手术环境中的失效模式，比较了其与分割掩码初始化的性能差异，并提出了改进跟踪性能的实用建议。

Details

Motivation: 基于点的跟踪在手术视频中具有低成本和高效率的优势，但其在复杂手术环境中的可靠性和失效情况尚不明确，因此需要系统性分析其表现和问题。 Method: 研究聚焦于胆囊、抓钳和L型电钩三个手术目标，使用SAM2等VOS模型，在零样本设置下对比基于点的跟踪与分割掩码初始化的性能，并通过定性分析识别关键影响因素。 Result: 基于点的跟踪在手术工具上表现良好，但在解剖结构（如胆囊）上因组织相似性和边界模糊而表现较差；研究总结了导致跟踪失败的关键因素。 Conclusion: 基于点的跟踪适用于手术工具，但对解剖目标存在局限性，需谨慎选择和放置跟踪点，文中提出了若干提升手术视频分析性能的实用建议。 Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

[146] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

Ding-Ruei Shen

Main category: cs.CV

TL;DR: 本文提出了一个名为FFREEDG的新任务，旨在联邦学习环境下利用无标签客户端数据进行语义分割，并提出FRIEREN框架，结合视觉-语言模型和一致性学习策略，在不访问源域数据的情况下实现跨域适应，取得了良好性能。

Details

Motivation: 现有联邦学习方法大多假设客户端有标签数据，或未能充分利用现代视觉基础模型，难以应对无标签数据下的域偏移问题。 Method: 提出FRIEREN框架，利用CLIP文本嵌入引导视觉-语言解码器，提升语义分辨能力，并采用弱增强到强增强的一致性学习策略在伪标签上进行鲁棒本地训练。 Result: 在合成到真实场景和正常到恶劣天气的基准上，FRIEREN表现出色，性能媲美现有的域泛化与适应方法，为FFREEDG任务建立了强有力的基线。 Conclusion: FRIEREN有效解决了无标签联邦语义分割中的跨域挑战，展示了视觉基础模型在联邦学习中的潜力，为未来研究提供了新方向。 Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

[147] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Shu Zou,Xinyu Tian,Lukas Wesemann,Fabian Waschkowski,Zhaoyuan Yang,Jing Zhang

Main category: cs.CV

TL;DR: 提出ASK-Hint，一种基于动作中心知识的结构化提示框架，用于提升冻结视觉语言模型在视频异常检测中的性能。

Details

Motivation: 现有提示方法过于抽象，忽略了定义复杂异常的细粒度人-物交互或动作语义。 Method: 将提示组织成语义连贯的类别，并设计细粒度引导问题，使模型预测与判别性视觉线索对齐。 Result: 在UCF-Crime和XD-Violence上显著提升AUC，达到SOTA性能，且具有良好的跨数据集和VLM主干的泛化能力。 Conclusion: 提示粒度至关重要，ASK-Hint是一种新型的无需训练、可泛化且可解释的视频异常检测方案。 Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

[148] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou,Xu Zhang,Yi Bin,Jian Liu,Bo Peng,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 提出GeoPurify方法，通过几何先验净化2D VLM生成的3D点特征，在极低数据量下实现高效3D语义分割。

Details

Motivation: 现有2D到3D特征迁移方法在噪声与训练成本间存在权衡，难以兼顾语义与几何一致性。 Method: 设计Student Affinity Network利用3D自监督教师模型提取的几何先验来净化特征，并引入几何引导池化模块在推理时去噪。 Result: 在主要3D基准上达到或超越SOTA性能，仅使用约1.5%的训练数据。 Conclusion: GeoPurify有效缓解了2D-3D迁移中的性能与效率矛盾，实现了高数据效率的3D语义分割。 Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).

[149] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications

Emmanuel Nsengiyumvaa,Leonard Niyitegekaa,Eric Umuhoza

Main category: cs.CV

TL;DR: 提出一种基于耳部静脉模式的非侵入式猪只生物识别方法，使用智能手机采集图像，通过计算机视觉和机器学习实现98.12%的识别准确率，适用于混合品种，具有成本低、实时性强的优势。

Details

Motivation: 传统猪只识别方法（如耳标和芯片）存在不可靠、成本高、对小规模农户不实用等问题，尤其难以应用于混合品种，因此需要一种低成本、非侵入且可靠的替代方案。 Method: 采集20头混种猪的800张耳部图像，利用智能手机和简单背光拍摄；开发多阶段计算机视觉流程增强静脉可见性，提取结构与空间特征，并使用支持向量机（SVM）等机器学习模型进行分类。 Result: SVM模型在混合品种猪群中实现了98.12%的识别精度，平均处理时间为8.3秒，验证了系统在实际农场环境中实时部署的可行性。 Conclusion: 基于耳部静脉的生物识别技术是一种可行、高效且经济的猪只识别方法，可取代传统物理标识，为资源有限的农业社区实现精准养殖提供技术支持。 Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.

[150] MMDEW: Multipurpose Multiclass Density Estimation in the Wild

Villanelle O'Reilly,Jonathan Cox,Georgios Leontidis,Marc Hanheide,Petra Bosilj,James Brown

Main category: cs.CV

TL;DR: 提出了一种基于Twins金字塔视觉Transformer的多类别密度图估计方法，通过多尺度解码和类别聚焦模块，在密集场景下显著优于现有方法，并在生态监测中展现了应用潜力。

Details

Motivation: 传统检测方法在密集遮挡场景中难以准确计数，因此需要一种更有效的多类别密度图估计方法来解决这一问题。 Method: 采用Twins金字塔视觉Transformer作为骨干网络，结合先进的多尺度解码方法设计专用多类别计数头，并引入基于分割的类别聚焦模块以抑制训练时的类别间干扰。 Result: 在VisDrone和iSAID基准上显著优于先前方法（MAE减少33%、43%和64%），且在与YOLOv11对比中验证了其在密集场景中的优势；进一步应用于生物多样性监测数据集，展示了其跨领域潜力。 Conclusion: 该方法不仅提升了多类别人群计数性能，还拓展至生态保护等新领域，具备推动大规模生态研究的应用前景。 Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

[151] TempoControl: Temporal Attention Guidance for Text-to-Video Models

Shira Schiber,Ofir Lindenbaum,Idan Schwartz

Main category: cs.CV

TL;DR: TempoControl是一种无需重新训练或额外监督即可实现文本到视频生成中视觉概念时间对齐的方法，利用交叉注意力图通过相关性、能量和熵三个原则优化概念的时间控制。

Details

Motivation: 现有的生成视频模型缺乏对视觉元素出现时间的细粒度控制，难以满足用户对特定时序安排的需求。 Method: 提出TempoControl方法，利用文本到视频扩散模型中的交叉注意力图，通过优化注意力的时间形状（相关性）、增强可见性区域（能量）和保持空间焦点（熵）来实现对视觉概念的精确时间控制。 Result: 在多种视频生成任务中验证了TempoControl的有效性，包括单个和多个对象的时间重排序、动作对齐和音频对齐生成，能够在保证视频质量和多样性的同时实现精确的时间控制。 Conclusion: TempoControl为现有文本到视频模型提供了一种灵活且高效的时间控制机制，无需修改模型结构或重新训练，具有广泛的应用潜力。 Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.

[152] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Sicheng Feng,Kaiwen Tuo,Song Wang,Lingdong Kong,Jianke Zhu,Huan Wang

Main category: cs.CV

TL;DR: 本文提出RewardMap，一种多阶段强化学习框架，用于提升多模态大语言模型在细粒度视觉推理任务中的表现，特别是在交通地图等复杂场景下的空间推理能力。

Details

Motivation: 现有MLLM在细粒度视觉推理（如交通图的空间理解）上表现不佳，且标准强化学习因奖励稀疏和优化不稳定而受限。 Method: 构建了带密集奖励信号的ReasonMap-Plus数据集，并提出RewardMap框架，包含难度感知奖励设计和多阶段强化学习策略，从感知逐步过渡到复杂推理。 Result: 在ReasonMap和ReasonMap-Plus上验证了各组件的有效性，组合使用效果最佳；在6个基准上平均提升3.47%，展现出更强的视觉理解与推理能力。 Conclusion: RewardMap通过密集奖励和渐进式训练显著提升了MLLM在细粒度和空间推理任务上的性能，具有广泛的应用潜力。 Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

[153] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Zihan Zhou,Shilin Lu,Shuli Leng,Shaocong Zhang,Zhuming Lian,Xinlei Yu,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: 本文提出了DragFlow，首个有效利用FLUX强生成先验的拖拽式图像编辑框架，通过区域化编辑范式、预训练适配器和多模态大模型显著提升了编辑质量。

Details

Motivation: 现有的拖拽式编辑方法在UNet架构上存在特征表达不足的问题，而新兴的DiT模型虽具备更强的生成先验，但尚未被有效应用于拖拽编辑任务。 Method: 提出区域化的拖拽编辑范式，使用仿射变换提供更一致的特征监督；结合预训练的IP-Adapter增强主体一致性，并通过梯度掩码保持背景保真度；引入多模态大语言模型解决任务歧义。 Result: 在DragBench-DR和新构建的ReD Bench上实验表明，DragFlow优于现有点基和区域基线方法，达到SOTA性能。 Conclusion: DragFlow成功将DiT架构的强大生成先验应用于拖拽式编辑，验证了区域化监督和多模块协同设计的有效性，推动了拖拽式编辑的发展。 Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

[154] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Guangyu Sun,Archit Singhal,Burak Uzkent,Mubarak Shah,Chen Chen,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频大语言模型处理方法F2C，通过选择关键片段（key clips）而非孤立帧，并结合自适应分辨率策略，在保持计算预算不变的情况下提升了长视频理解性能。

Details

Motivation: 现有方法因仅选择稀疏帧而丢失时间动态信息，导致对运动和事件连续性的推理不足；同时视频产生的视觉token过多导致上下文窗口受限。 Method: 将帧选择扩展到时间连贯的关键片段，并采用自适应分辨率策略动态平衡空间分辨率与片段长度，以保持每段视频的token数量恒定。 Result: 在Video-MME、LongVideoBench和MLVU三个长视频基准上，F2C分别比均匀采样提升8.1%、5.6%和10.3%。 Conclusion: 保留时间连贯性对于视频理解至关重要，F2C为扩展视频大语言模型在实际应用中的使用提供了有效路径。 Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

[155] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

Mario Medrano-Paredes,Carmen Fernández-González,Francisco-Javier Díaz-Pernas,Hichem Saoudi,Javier González-Alonso,Mario Martínez-Zarzuela

Main category: cs.CV

TL;DR: 本研究比较了单目视频3D人体姿态估计模型与惯性测量单元（IMU）在真实场景下进行运动学评估的性能，使用包含13种日常活动的VIDIMU数据集，结果显示MotionAGFormer表现最佳，表明两种技术均具可行性，但各有成本、可及性和精度方面的权衡。

Details

Motivation: 为了实现非实验室环境下对人体运动的准确评估，推动远程医疗、运动科学和康复应用的发展，需要比较新兴的视频姿态估计算法与传统IMU方法的性能。 Method: 利用VIDIMU数据集，采集健康受试者执行13种日常活动时的单目视频和五节点IMU数据，采用OpenSim逆向动力学将IMU数据计算为关节角度，并与基于深度学习的视频姿态估计模型（MotionAGFormer、MotionBERT、MMPose 2D-to-3D、NVIDIA BodyTrack）输出的3D姿态进行对比，评估指标包括RMSE、MAE、Pearson相关系数和R²，关键点遵循Human3.6M的17点格式。 Result: MotionAGFormer表现最优，整体RMSE为9.27°±4.80°，MAE为7.86°±4.18°，Pearson相关系数达0.86±0.15，R²为0.67±0.28；视频与IMU两种技术均可用于非实验室环境下的运动学分析，但在成本、可访问性和精度方面存在权衡。 Conclusion: 现成的视频姿态估计模型（尤其是MotionAGFormer）在健康成人中已具备临床潜力，可作为IMU的替代方案，但需根据具体应用场景权衡技术选择，本研究为远程健康监测系统的设计提供了实用指南。 Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

[156] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang,Dong Liang,Yihang Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为NeuroSwift的新方法，通过结合扩散模型中的互补适配器（AutoKL和CLIP），实现从fMRI数据中跨被试重建视觉刺激。该方法在仅微调17%参数的情况下实现了最先进的性能，并可在轻量级GPU上快速训练。

Details

Motivation: 由于被试间神经表征的差异以及大脑对复杂视觉输入的抽象语义编码，现有的生成模型在准确跨被试重建视觉刺激方面仍面临挑战且计算成本高。 Method: NeuroSwift引入两个适配器：AutoKL处理低级特征，CLIP处理语义信息；CLIP适配器使用Stable Diffusion生成的图像与COCO标题配对进行训练以模拟高级视觉皮层编码；通过在一个被试上预训练，然后仅微调全连接层（17%参数）实现跨被试泛化。 Result: NeuroSwift在仅使用三块RTX 4090 GPU、每名新被试训练一小时的情况下，达到了最先进的跨被试视觉重建性能，优于现有方法。 Conclusion: NeuroSwift通过参数高效的微调策略和语义感知的适配器设计，显著提升了跨被试视觉重建的准确性与实用性，为基于fMRI的视觉解码提供了高效可行的解决方案。 Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

[157] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Sathira Silva,Eman Ali,Chetan Arora,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了microCLIP，一种用于细粒度图像分类的自训练框架，通过引入Saliency-Oriented Attention Pooling（SOAP）和TokenFusion模块，结合全局与局部特征，并利用双头LLM衍生分类器和动态知识聚合策略，在13个细粒度基准上平均提升了2.90%的准确率。

Details

Motivation: CLIP模型在零样本迁移中表现良好，但其依赖全局粗略特征，难以捕捉细粒度图像中的微观局部线索，限制了其在细粒度分类任务上的性能。因此，需要一种能够增强CLIP对局部细节敏感性的无监督适应方法。 Method: 提出microCLIP框架，核心是TokenFusion模块中的Saliency-Oriented Attention Pooling（SOAP），生成由显著性引导的[FG]令牌并与全局[CLS]令牌融合，实现粗-细对齐；采用双头LLM-derived分类器，一个冻结用于稳定伪标签先验，另一个可学习并微调；引入动态知识聚合，迭代优化伪标签。 Result: 在13个细粒度图像分类基准上实现了平均2.90%的准确率提升，且仅需轻量级适配，验证了方法的有效性和稳定性。 Conclusion: microCLIP通过联合优化视觉和文本表示，成功挖掘了CLIP中潜在的细粒度信号，显著提升了无监督适应下CLIP在细粒度分类任务上的性能，为后续研究提供了高效、稳定的解决方案。 Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

[158] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park,Yifan Yang,Juheon Yi,Shicheng Zheng,Yifei Shen,Dongqi Han,Caihua Shan,Muhammad Muaz,Lili Qiu

Main category: cs.CV

TL;DR: VidGuard-R1 是首个基于多模态大语言模型的视频真实性检测器，采用组相对策略优化（GRPO）进行微调，在真实与AI生成视频的鉴别任务中表现出色，具备高准确率和可解释性。

Details

Motivation: 随着AI生成视频的快速发展，亟需有效的检测工具以应对虚假信息和社会风险，同时要求模型具备可解释性以增强对监管机构和用户的透明度。 Method: 提出VidGuard-R1，通过在包含14万条真实与AI生成视频的数据集上，使用GRPO方法微调Qwen-VL，并引入两个专门的奖励模型来捕捉时间伪影和生成复杂性特征。 Result: 在现有基准上实现了最先进的零样本检测性能，经过额外训练后准确率超过95%，并能生成精确且可解释的判断依据。 Conclusion: VidGuard-R1在视频真实性检测方面达到了领先水平，兼具高精度与可解释性，为应对AI生成内容带来的社会风险提供了有效工具。 Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.

[159] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的方法，用于缓解长时视频生成中的质量退化问题，无需依赖长视频教师模型或在长视频数据集上重新训练。该方法通过从自生成的长视频中采样片段来指导学生模型，利用教师模型的丰富知识，在保持时间一致性的同时将视频长度扩展至教师模型能力的20倍以上，并能生成长达4分15秒的视频。

Details

Motivation: 扩散模型在图像和视频生成方面取得了显著进展，但其基于Transformer架构导致计算成本高昂，尤其在生成长视频时更为明显。现有方法通常依赖于短视域的双向教师模型进行蒸馏，但由于教师模型无法生成长视频，学生模型在超出训练范围时会出现严重的质量下降。因此，需要一种新方法来解决长视频生成中的误差累积和质量退化问题。 Method: 本文提出的方法核心是利用教师模型的知识，通过对自生成的长视频中采样的片段进行指导，从而提升学生模型在长时域上的生成能力。该方法不需要长视频教师监督或重新训练，也不需像以往方法那样重复计算重叠帧，能够在不增加额外计算负担的情况下实现更长视频的生成。 Result: 实验表明，该方法在标准基准和改进的基准上均显著优于基线方法，无论是在生成质量还是时间一致性方面。所生成的视频最长可达4分15秒，相当于基础模型位置编码支持的最大跨度的99.9%，比基线模型长50多倍。 Conclusion: 本文提出的自强制增强方法能够有效缓解长时视频生成中的质量退化问题，在无需长视频监督的情况下显著提升生成视频的长度和质量，展示了在大规模应用中的潜力。 Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

[160] Learning to Generate Object Interactions with Physics-Guided Video Diffusion

David Romero,Ariana Bermudez,Hao Li,Fabio Pizzati,Ivan Laptev

Main category: cs.CV

TL;DR: 本文提出了KineMask，一种物理引导的视频生成方法，通过结合低层次运动控制和高层次文本条件，在真实场景中实现了更逼真的刚体交互与运动预测。

Details

Motivation: 现有视频生成模型在物理合理的物体交互和物理基础的控制机制方面仍存在不足，限制了其在机器人和具身决策等领域的应用。 Method: 提出了一种两阶段训练策略，逐步去除未来运动监督，利用合成场景中的简单交互训练视频扩散模型，并结合对象掩码和速度条件生成具有推断运动的视频；同时融合低层次运动控制与高层次文本描述进行联合建模。 Result: 在真实场景中显著改善了物体交互的合理性，能够生成复杂的动力学现象，在同等规模模型中优于近期方法，消融实验验证了高低层次条件的互补作用。 Conclusion: KineMask有效提升了视频生成中的物理真实性和可控性，为视频生成模型作为世界模拟器的应用提供了新方向。 Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.

[161] MultiModal Action Conditioned Video Generation

Yichen Li,Antonio Torralba

Main category: cs.CV

TL;DR: 提出了一种包含本体感觉、动觉、力触觉和肌肉激活等多模态感知的细粒度动作建模方法，通过特征学习范式和正则化策略提升仿真精度与因果性，实验验证了其在减少时间漂移和提高交互模拟效果方面的优势。

Details

Motivation: 当前视频模型缺乏精细控制能力，难以满足通用家庭机器人对实时精细操作的需求，尤其是在处理精细任务和紧急情况时。 Method: 引入细粒度多模态动作表示，结合本体感觉、动觉、力触觉和肌肉激活等感知模态；设计特征学习范式以对齐各模态并保留其独特信息，并提出正则化方案增强动作轨迹特征的因果性。 Result: 实验表明，融合多模态感知能显著提升仿真准确性、减少时间漂移，且在消融实验和下游应用中表现出良好效果。 Conclusion: 所提出的多模态细粒度动作建模方法有效提升了机器人对复杂交互动态的模拟能力，具有实际应用潜力。 Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

[162] VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song,Wenhao Chai,Shusheng Yang,Ethan Armand,Xiaojun Shan,Haiyang Xu,Jianwen Xie,Zhuowen Tu

Main category: cs.CV

TL;DR: VideoNSA通过引入原生稀疏注意力（NSA）机制，提升多模态语言模型在长视频理解中的表现，支持长达128K token的上下文，并在时间推理和空间任务上显著优于基线方法。

Details

Motivation: 现有视频语言模型受限于上下文长度，难以捕捉关键过渡帧并维持长时间连贯性。 Method: 提出VideoNSA，采用硬件感知的混合注意力机制：文本使用密集注意力，视频使用原生稀疏注意力（NSA），并在216K视频指令数据集上对Qwen2.5-VL进行端到端训练。 Result: 在长视频理解、时间推理和空间任务基准上优于token压缩和训练-free稀疏基线方法；可稳定扩展至128K token，发现最优全局-局部注意力分配、任务依赖的分支使用模式，以及可学习的稀疏注意力能形成动态注意力汇聚。 Conclusion: VideoNSA有效解决了长视频理解中的上下文长度限制问题，通过混合注意力架构实现了高效且可扩展的视频语言建模。 Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

[163] NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez

Main category: cs.CV

TL;DR: 本文提出了一种名为NoiseShift的训练-free方法，通过根据图像分辨率重新校准去噪器的噪声水平，解决了文本到图像扩散模型在不同分辨率下生成质量不一致的问题。

Details

Motivation: 现有的高分辨率文本到图像生成模型在低分辨率生成时表现不佳，主要由于噪声调度器在不同分辨率下的感知效应不均，导致训练与测试间的不匹配。 Method: 提出NoiseShift方法，无需修改模型结构或采样调度，通过调整噪声级别以适应不同分辨率，实现去噪器的重校准。 Result: 在Stable Diffusion 3、3.5和Flux-Dev上应用NoiseShift后，低分辨率图像生成质量显著提升。在LAION-COCO和CelebA数据集上，FID指标均有明显改善，例如SD3.5在LAION-COCO上平均提升15.89%。 Conclusion: NoiseShift有效缓解了分辨率依赖的生成缺陷，提升了低分辨率图像生成的质量，且兼容现有模型，无需额外训练。 Abstract: Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

[164] Inferring Dynamic Physical Properties from Video Foundation Models

Guanqi Zhan,Xianzheng Ma,Weidi Xie,Andrew Zisserman

Main category: cs.CV

TL;DR: 本论文研究从视频中预测动态物理属性（如弹性、粘度和动态摩擦）的任务，提出了新的合成与真实数据集，并比较了基于视觉提示、预训练视频模型和多模态大语言模型的三种推理方法，发现生成式或自监督视频基础模型表现较好，但多模态大语言模型仍有提升空间。

Details

Motivation: 许多物理属性需要时间信息才能推断，而现有方法在从视频中提取这些动态属性方面仍存在挑战，因此需要探索更有效的建模方式。 Method: 构建了针对弹性、粘度和动态摩擦的新视频数据集，采用三种方法进行属性推断：基于经典计算机视觉的oracle方法、使用可学习提示向量对预训练视频模型进行跨注意力的简单读出机制，以及针对多模态大语言模型的提示策略。 Result: 生成式或自监督的视频基础模型性能接近oracle方法但稍逊，而多模态大语言模型当前表现较差，但通过合适的提示策略可以提升其性能。 Conclusion: 视频基础模型在推断动态物理属性方面具有潜力，而多模态大语言模型虽落后但可通过优化提示进一步改进。 Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

[165] Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

Mengyu Yang,Yiming Chen,Haozheng Pei,Siddhant Agarwal,Arun Balajee Vasudevan,James Hays

Main category: cs.CV

TL;DR: 提出了一种基于声音的对象检测任务，利用多模态对象感知框架和槽注意力视觉编码器，在真实场景的自我中心视频上学习，实现了最先进的性能。

Details

Motivation: 让模型能够将日常物体交互产生的声音与直接参与的物体关联起来，模拟人类感知能力。 Method: 开发了一个自动管道生成参与物体的分割掩码，并使用槽注意力视觉编码器来增强对象中心的学习。 Result: 在新提出的 sounding object detection 任务以及现有的多模态动作理解任务上达到了最先进的性能。 Conclusion: 该框架有效提升了模型对声音与物体关系的理解能力，适用于复杂的真实环境。 Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

[166] StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Bo-Hsu Ke,You-Zhe Xie,Yu-Lun Liu,Wei-Chen Chiu

Main category: cs.CV

TL;DR: 提出一种基于密度引导的3D高斯点云投毒方法，通过在低密度区域注入高斯点并结合自适应噪声策略，实现对3D高斯泼溅（3DGS）的隐蔽且有效的图像级投毒攻击。

Details

Motivation: 随着NeRF和3DGS等3D场景表示方法的广泛应用，其安全性问题日益重要，尤其是面对图像级投毒攻击时的鲁棒性亟需研究。 Method: 利用核密度估计（KDE）识别低密度区域，在这些区域战略性地注入带有视点依赖幻觉物体的高斯点，并引入自适应噪声破坏多视角一致性，从而提升攻击效果。 Result: 实验表明，该方法在保持对正常视角影响最小的同时，能有效在目标视角生成明显幻觉物体，且攻击成功率优于现有最先进方法；同时提出基于KDE的评估协议用于系统衡量攻击难度。 Conclusion: 所提出的密度引导投毒方法显著提升了对3DGS的隐蔽攻击效能，揭示了3D场景重建模型在安全方面的脆弱性，并为未来相关防御研究提供了可量化的评估基准。 Abstract: 3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

[167] Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Eric Tillmann Bill,Enis Simsar,Thomas Hofmann

Main category: cs.CV

TL;DR: 本文提出了首个用于提升文本到图像模型在多主体生成任务中保真度的理论框架，基于随机最优控制优化采样动态，提出两种算法：无需训练的测试时控制器和轻量级微调方法Adjoint Matching，并引入FOCUS方法实现当前最优性能。

Details

Motivation: 现有文本到图像模型在处理多主体提示时存在属性泄露、身份纠缠和主体遗漏等问题，缺乏系统性的理论指导和优化目标。 Method: 将流匹配（Flow Matching）与随机最优控制（SOC）结合，将主体解耦建模为对FM采样器的控制问题，提出测试时速度扰动策略和Adjoint Matching微调方法，统一并扩展了先前的注意力启发式方法。 Result: 在Stable Diffusion 3.5、FLUX和SDXL等模型上验证了所提方法能显著提升多主体对齐能力，同时保持原始模型风格；测试时控制高效，微调后控制器具有良好的泛化性。 Conclusion: 该工作为多主体文本到图像生成提供了可优化的理论框架和实用算法，实现了高保真、解耦的主体生成，推动了该方向的发展。 Abstract: Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

Table of Contents

cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[2] Towards Open-Ended Discovery for Low-Resource NLP

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

[12] Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition

[13] LLMRank: Understanding LLM Strengths for Model Routing

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

[16] Silent Tokens, Loud Effects: Padding in LLMs

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

[31] Longitudinal Monitoring of LLM Content Moderation of Social Issues

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

[57] How Do Language Models Compose Functions?

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

[60] Machine-interpretable Engineering Design Standards for Valve Specification

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

[64] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

[69] Inverse Language Modeling towards Robust and Grounded LLMs

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

[72] Exploring Database Normalization Effects on SQL Generation

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

[78] The Disparate Impacts of Speculative Decoding