cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: 提出了一种通过上下文概念学习任务来揭示大语言模型中隐性偏见的数据集，发现模型对量化词存在向上单调性偏好，且这种偏见在直接提示下不明显。

Details

Motivation: 为了揭示大语言模型中存在的隐性偏见，特别是在量化词使用上的潜在倾向。 Method: 构建了一个概念学习任务数据集，并通过上下文内的概念学习实验测试语言模型的表现。 Result: 发现语言模型在上下文概念学习中表现出对向上单调性量化词的偏好，而这种偏见在直接提示任务中较不明显。 Conclusion: 上下文内的概念学习是一种有效发现语言模型隐藏偏见的方法。 Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[2] Towards Open-Ended Discovery for Low-Resource NLP

Bonaventure F. P. Dossou,Henri Aïdasso

Main category: cs.CL

TL;DR: 本文主张通过人机协作的交互式对话来实现低资源语言的动态学习，提出一种基于共同不确定性的框架，推动从静态数据收集向参与式、共适应学习的范式转变。

Details

Motivation: 低资源语言因缺乏语料库、标准化正字法和可扩展的标注流程而受限，现有大模型依赖大量集中数据，难以惠及边缘化社区。 Method: 提出一种以人类与机器不确定性为基础的框架，结合模型的认知不确定性与人类说话者的犹豫信号和置信度提示，指导交互、问题选择和记忆保留。 Result: 倡导从抽取式数据收集转向参与式、共适应的学习过程，促进语言技术对语言多样性发现与保护的能力。 Conclusion: 未来语言技术应走向以人为中心、互动合作的人机共同建模，尊重并赋权语言社区，实现真正包容的语言AI发展。 Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

Bertrand Kian Hassani,Yacoub Bahini,Rizwan Mushtaq

Main category: cs.CL

TL;DR: 该研究利用微调后的大语言模型，构建多维框架评估828家美国上市公司的气候信息披露成熟度，发现当前披露存在承诺与目标脱节、模仿现象普遍等问题，强调需加强监管以提升信息的决策价值。

Details

Motivation: 应对气候变化背景下企业气候信息披露需求上升，但普遍存在模仿和象征性报告问题，导致信息透明度和可比性不足。 Method: 通过微调大语言模型，构建包含情感、承诺、具体性和目标雄心四个分类器的框架，从企业可持续发展报告和年报中提取叙述性指标，并与企业排放量、市值和行业等特征关联分析。 Result: （1）风险导向的叙述常与明确承诺一致，但定量目标与语调脱节；（2）规模大、排放高的企业披露更多承诺和行动，但与定量目标不一致；（3）披露风格高度相似，显示模仿行为普遍，降低信息区分度和决策有用性。 Conclusion: 大语言模型有助于ESG叙述分析，但需更强监管以将气候承诺与可验证的转型路径挂钩，提升披露质量。 Abstract: Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 U.S.listed firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

Tyler J Poore,Christopher J Pinard,Aleena Shabbir,Andrew Lagree,Andre Telfer,Kuan-Chuen Wu

Main category: cs.CL

TL;DR: 该研究评估了三种商用兽医领域大语言模型（LLM）摘要工具在兽医肿瘤病例记录上的表现，发现专注于兽医领域的Product 1（Hachiko）在准确性、完整性等方面显著优于其他产品，且采用LLM作为评判者的评估框架具有高可重复性。

Details

Motivation: 尽管大型语言模型（LLMs）在临床环境中应用日益广泛，但其在兽医学领域的表现尚不明确，因此需要系统评估现有商用兽医专用LLM工具的性能。 Method: 研究使用标准化的兽医肿瘤学病历数据集，通过基于评分标准的“LLM-as-a-judge”框架，从五个维度（事实准确性、完整性、时间顺序、临床相关性和组织结构）对三种商用LLM生成的摘要进行评分，并进行了三次独立评估以检验评分框架的一致性。 Result: Product 1的中位总评分为4.61（IQR: 0.73），显著高于Product 2（2.55）和Product 3（2.45），并在事实准确性和时间顺序上获得满分；LLM评分器表现出高可重复性，各产品的平均分标准差分别为0.015、0.088和0.034。 Conclusion: 兽医专用的LLM工具在临床摘要任务中表现更优，且LLM-as-a-judge方法是评估兽医领域临床自然语言处理摘要性能的一种可扩展且可重复的有效手段。 Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

Akshith Reddy Putta,Jacob Devasier,Chengkai Li

Main category: cs.CL

TL;DR: ClaimCheck是一个基于小语言模型的LLM引导自动事实核查系统，通过分步透明的验证流程，利用实时网络证据实现高效、可解释的事实核查。

Details

Motivation: 现有事实核查系统依赖大型闭源模型和静态知识库，计算成本高且缺乏透明度，限制了其可访问性和实用性。 Method: 设计了一个模块化流水线，包括网页搜索查询规划、基于网页的证据检索与摘要、证据综合与再检索、以及声明结论评估，各模块均针对小型语言模型优化。 Result: 在AVeriTeC数据集上使用Qwen3-4B模型达到76.4%的准确率，优于使用LLaMA3.1 70B和GPT-4o的先前方法，且计算需求显著降低。 Conclusion: 精心设计的模块化架构和提示策略可以克服小型语言模型的局限性，实现高性能、低资源消耗且透明的事实核查系统。 Abstract: We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at https://idir.uta.edu/claimcheck.

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner

Main category: cs.CL

TL;DR: 本文质疑当前大语言模型在数学基准测试中表现出的高水平能力，指出现有基准可能存在数据污染和问题类型局限性，并提出一个新的基准EEFSUVA，以更全面地评估模型的数学推理能力。

Details

Motivation: 现有数学基准主要来自国际数学奥林匹克等常见竞赛，可能存在数据泄露和过度拟合问题，导致高估模型的真实推理能力。 Method: 构建了一个新的数学问题基准EEFSUVA，源自东欧及前苏联国家较少传播的地区性和国家级奥赛题目，这些问题难度相当但解法更非常规，且在线数据集中较少出现。 Result: 实验表明，即使最先进的大语言模型在EEFSUVA上的表现也显著下降，说明当前模型在非主流、非常规问题上的推理能力有限。 Conclusion: 需要更广泛、更多样化的评估数据集来真实衡量大语言模型的数学推理能力，并指导未来模型的发展方向。 Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

Main category: cs.CL

TL;DR: 大型语言模型应遵循系统提示优先于用户输入的层级指令，但研究发现它们常忽视此规则而更服从社会性线索（如权威或共识）。本文通过大规模数据分析其机制，揭示系统-用户冲突与社会性冲突在模型中形成不同的表征子空间，且内部冲突检测在系统-用户场景中更强，但决策一致性仅见于社会性线索。操纵实验显示，尽管依赖社会性线索，相关向量却以角色无关的方式增强指令遵循能力。结果解释了系统指令服从的脆弱性，并强调需开发轻量级、层级敏感的对齐方法。

Details

Motivation: 大型语言模型在实际应用中应优先遵守系统设定的指令而非用户输入，即应具备层级化的指令服从能力。然而，近期研究表明模型往往忽略这一原则，反而更容易受到社会性因素（如权威、共识）的影响。为了深入理解这一行为背后的机制，需要从模型内部表征和决策过程的角度进行系统性分析，从而揭示为何系统指令容易被违背，并为改进模型对齐提供依据。 Method: 本研究结合线性探测（Linear Probing）、直接logit归因（Direct Logit Attribution）和定向操纵实验（Steering Experiments）三种方法，在大规模数据集上分析大型语言模型处理系统指令与用户输入冲突、以及社会性冲突的内部机制。利用线性探测识别模型早期层中的冲突决策信号及其表征子空间；通过直接logit归因分析不同冲突类型下内部冲突检测与解决的动态过程；最后使用向量操纵方法测试社会性线索对指令遵循行为的影响。 Result: 研究发现：（1）系统-用户冲突与社会性冲突在模型中早期即被编码，并分别形成独立的表征子空间；（2）系统-用户冲突引发更强的内部冲突检测信号，但最终决策仅在社会性线索下表现出一致性；（3）操纵实验表明，与社会性线索相关的向量能够以角色无关的方式增强模型的整体指令遵循能力，而非仅偏向某一方。 Conclusion: 大型语言模型在处理指令冲突时虽能识别系统优先的原则，但其决策更依赖于社会性线索，导致系统指令服从脆弱。这反映了当前对齐机制在层级结构建模上的不足。研究建议未来应发展轻量级、能够感知指令层级的对齐方法，以提升模型对系统指令的可靠遵循。 Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

Dimitar Peshevski,Kiril Blazhevski,Martin Popovski,Gjorgji Madjarov

Main category: cs.CL

TL;DR: 提出一种利用大模型生成合成查询和标注数据的管道，用于训练小型Transformer模型进行文档重排序，显著降低成本并保持性能。

Details

Motivation: 大模型虽然在重排序任务上表现优异，但计算成本高；而小模型依赖稀缺的人工标注数据，限制了应用。 Method: 使用大模型从领域语料库生成合成查询，并用大模型分类器标注正例和难负例样本，通过对比学习（LCE损失）微调小型Transformer模型。 Result: 在MedQuAD数据集上实验表明，该方法显著提升领域内性能，并具有良好跨域泛化能力。 Conclusion: 通过利用大模型生成数据和监督信号而非直接推理，可在降低计算成本的同时保持强大的重排序性能。 Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: 该研究通过PHATE流形分析在多种中文字符嵌入模型中系统地揭示了几何模式，发现实词呈现聚类、虚词呈现分支结构，且几何复杂性与语义丰富性相关。

Details

Motivation: 探究中文字符嵌入中的几何结构是否反映其语义和语言学特性，验证传统语言学理论在现代嵌入空间中的适用性。 Method: 结合七种嵌入模型和八种降维方法，使用PHATE流形分析对1000多个汉字在12个语义域中的几何结构进行交叉验证，并进行包含123个短语的子网络分析。 Result: 实词在嵌入空间中形成聚类，虚词呈现分支结构；语义丰富的字符几何形态多样，而构字部首则聚集成紧密簇；子网络分析显示语义从基本字符系统扩展。 Conclusion: 中文字符嵌入的几何结构系统地反映了其语义组织，为传统语言学理论提供了计算支持，并建立了语义结构几何分析的新框架。 Abstract: We systematically investigate geometric patterns in Chinese character embeddings using PHATE manifold analysis. Through cross-validation across seven embedding models and eight dimensionality reduction methods, we observe clustering patterns for content words and branching patterns for function words. Analysis of over 1000 Chinese characters across 12 semantic domains reveals that geometric complexity correlates with semantic content: meaningful characters exhibit rich geometric diversity while structural radicals collapse into tight clusters. The comprehensive child-network analysis (123 phrases) demonstrates systematic semantic expansion from elemental character. These findings provide computational evidence supporting traditional linguistic theory and establish a novel framework for geometric analysis of semantic organization.

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

Shuaidong Pan,Di Wu

Main category: cs.CL

TL;DR: 提出了一种结合不确定性量化和风险感知机制的大语言模型框架，用于提升高风险场景下自动摘要的可靠性。

Details

Motivation: 应对信息过载和高风险决策中对可靠自动摘要的需求，避免传统模型过度自信的生成问题。 Method: 构建基于条件生成的摘要模型，引入贝叶斯推断建模参数空间不确定性，使用预测分布熵衡量生成内容的不确定性，并联合优化熵正则化与风险感知损失；模型还集成风险评分与调控模块。 Result: 实验表明该方法在保持流畅性和语义完整性的同时，显著提升了高风险应用中摘要的鲁棒性和可靠性。 Conclusion: 该研究为可信摘要提供了系统性解决方案，在方法论层面具有可扩展性和实际应用价值。 Abstract: This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim,Gyuho Shim,Yongchan Chun,Minhyuk Kim,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了“基准分析”框架，通过将基准性能分解为十种认知能力，揭示了现有基准测试往往掩盖任务真实能力需求的问题。

Details

Motivation: 现有的基准测试评分容易高估模型的真实能力，因为它们无法准确反映任务所需的多种技能组合，因此需要一种系统的方法来验证这些基准是否真正测量了其所声称的能力。 Method: 结合基于梯度的重要性评分与有针对性的参数消融方法，提出能力影响分数（AIS），量化每种能力对模型在特定基准上表现的贡献。 Result: 对三种指令调优模型在十个常用基准上的分析发现：大多数基准依赖多种能力而非单一能力；标签相似的数据集实际依赖不同的能力组合；代码生成基准更青睐多技能提升，领域特定微调效果有限；无关能力可能负面影响性能。 Conclusion: 基准分析解释了为何性能提升不一定转化为用户感知的能力提升，并为基准审计和模型可解释性提供了透明工具。 Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Boddu Sri Pavan,Boddu Swathi Sree

Main category: cs.CL

TL;DR: 本研究提出了一种计算社会科学方法，用于保护泰卢固语诗歌格律（Chandassu）这一文化遗产，开发了首个分析泰卢固语韵律的数字框架，结合社区知识与现代计算技术。

Details

Motivation: 保护濒临消失的泰卢固语诗歌格律传统，保存蕴含在其中的集体文化智慧，并通过现代技术促进文化传承。 Method: 采用社会计算方法，构建包含4651条标注诗句的数据集，设计符合文化背景的算法，包括音节切分、轻重音分类和诗律识别模块。 Result: 所提算法在新提出的Chandassu Score上达到91.73%的准确率，评估指标符合传统文学标准。 Conclusion: 计算社会科学可有效保护濒危文化知识体系，该方法为以社区为中心的文化遗产数字化保护提供了可行路径，对数字人文和社会感知计算具有启示意义。 Abstract: This research presents a computational social science approach to preserving Telugu Chandassu, the metrical poetry tradition representing centuries of collective cultural intelligence. We develop the first comprehensive digital framework for analyzing Telugu prosodic patterns, bridging traditional community knowledge with modern computational methods. Our social computing approach involves collaborative dataset creation of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design. The framework includes AksharamTokenizer for prosody-aware tokenization, LaghuvuGuruvu Generator for classifying light and heavy syllables, and PadyaBhedam Checker for automated pattern recognition. Our algorithm achieves 91.73% accuracy on the proposed Chandassu Score, with evaluation metrics reflecting traditional literary standards. This work demonstrates how computational social science can preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage. The methodology offers insights for community-centered approaches to cultural preservation, supporting broader initiatives in digital humanities and socially-aware computing systems.

[13] LLMRank: Understanding LLM Strengths for Model Routing

Shubham Agrawal,Prasang Gupta

Main category: cs.CL

TL;DR: LLMRank是一种基于提示特征的路由框架，通过多维度特征提取和混合排序目标，在11个基准和11个先进大语言模型上实现高达89.2%的Oracle效用，提升模型部署的效率与可解释性。

Details

Motivation: 随着大语言模型能力多样化，如何在性能与效率之间权衡，为每个提示选择最合适模型成为部署中的关键挑战。 Method: 提出LLMRank，利用任务类型、推理模式、复杂度指标、句法线索及轻量级代理求解器信号等可读特征，结合神经排序模型在RouterBench数据集上进行训练，实现提示感知的模型路由。 Result: 在包含36,497个提示、覆盖11个基准和11个最先进大语言模型的RouterBench上，LLMRank达到最高89.2%的Oracle效用，并提供可解释的特征归因。 Conclusion: 多维度特征提取与混合排序目标对高效、透明的大语言模型路由至关重要，LLMRank为实际部署提供了有效且可解释的解决方案。 Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque

Main category: cs.CL

TL;DR: 本文提出了一种资源高效的多阶段训练方法DermIQ-VLM，用于提升视觉语言模型在皮肤病诊断中的结构化推理能力。

Details

Motivation: 现有的视觉语言模型在医学图像分析中受限于数据稀缺和高计算成本，难以实现复杂的结构化推理。 Method: 提出改进的GRPO++算法进行推理导向的疾病识别，结合监督微调提升对话能力，并利用基于知识图谱的直接偏好优化（DPO）减少事实错误。 Result: 在皮肤病数据集上的初步评估显示，该方法显著优于标准微调方法。 Conclusion: 该训练 pipeline 能有效开发出在资源受限环境下可靠、专业的视觉语言模型。 Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist's diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

Nandakishor M

Main category: cs.CL

TL;DR: 提出一种基于置信度的路由系统，通过在生成前评估模型不确定性来主动减少大语言模型的幻觉问题，结合语义对齐、层间收敛性和学习到的置信度信号，动态选择生成路径，在知识密集型问答任务中显著提升幻觉检测性能并降低计算成本。

Details

Motivation: 大语言模型容易产生事实性错误（幻觉），现有方法多依赖生成后的修正，计算开销大且无法预防不可靠内容的生成，因此需要更高效的事前干预机制。 Method: 提出一种置信感知的路由系统，结合三种信号：内部表征与参考嵌入的语义对齐、模型各层的内部收敛分析、以及学习得到的置信度估计，综合生成统一置信得分，并据此将查询路由至四个不同路径：本地生成、检索增强生成、更大模型生成或人工审核。 Result: 在知识密集型问答基准上的实验表明，该方法将幻觉检测能力从0.42（基线）提升至0.74，F1分数从0.61提升至0.82，误报率低（0.09），同时相比事后修正方法减少了40%的计算成本。 Conclusion: 从被动修正转向主动评估的范式转变，提供了一种计算效率更高的大语言模型可靠性增强方案。 Abstract: Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.

[16] Silent Tokens, Loud Effects: Padding in LLMs

Rom Himelstein,Amit LeVi,Yonatan Belinkov,Avi Mendelson

Main category: cs.CL

TL;DR: 研究表明，填充标记（padding tokens）在大语言模型中的处理不当会对模型的激活、生成质量、偏见和安全性产生显著负面影响，提示填充操作需在部署中谨慎处理。

Details

Motivation: 填充标记本应被完全掩码，但由于实现错误可能影响模型计算，而这种影响的程度尚不明确，因此需要系统研究其实际影响。 Method: 研究在Llama、Gemma和Qwen三个开源模型家族上，通过引入可控量的填充，从激活、生成质量、偏见和安全性四个维度评估其影响。 Result: 即使少量填充也会改变隐藏表示，降低小模型的生成质量，以不可预测的方式改变偏见，并削弱安全防护机制。 Conclusion: 填充标记并非无害细节，而是影响模型鲁棒性的潜在风险，必须在实际部署中加以仔细管理。 Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Juntae Lee,Jihwan Bang,Seunghan Yang,Simyung Chang

Main category: cs.CL

TL;DR: CIFLEX是一种用于单设备大语言模型在多轮交互中高效处理子任务的新型执行系统，通过重用主任务的KV缓存并使用隔离的侧路径注入任务特定指令，显著降低了计算开销。

Details

Motivation: 随着大语言模型能力的增强，单一模型需处理多种子任务以更好地支持用户请求，但传统方法在切换任务时重复处理整个对话上下文，导致计算开销大。 Method: CIFLEX通过重用主任务的键值（KV）缓存，并将任务特定指令注入到隔离的侧路径中来执行子任务，完成后通过缓存上下文回滚到主路径，避免冗余的prefill计算；同时采用分层分类策略支持子任务选择。 Result: 实验表明，CIFLEX在不降低任务性能的前提下显著减少了计算成本，能够在设备上实现可扩展且高效的多任务对话。 Conclusion: CIFLEX通过KV缓存重用和侧路径机制有效提升了多轮对话中子任务处理的效率，为单设备大模型的多任务处理提供了可行方案。 Abstract: We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Hu Wei,Ze Xu,Boyu Yang,Linlin Miao,Weiqi Zhai,Yihan Li,Zixuan Li,Zhijun Wang,Boya Wang,Jianwei Yu,Jialing Yuan,Xiaoyue Zhang,Cheng He,Minglei Chen,Zifan Zhang,Qianhui Li,Wei Wang,Xiang Xu

Main category: cs.CL

TL;DR: 本文提出了两个互补的数学基准测试：SKYLENAGE-ReasoningMATH 和 SKYLENAGE-MATH，旨在解决现有大语言模型在数学任务上的天花板效应问题。

Details

Motivation: 由于当前大语言模型在公共数学数据集上表现接近上限，难以区分前沿模型的性能差异，因此需要更具挑战性且结构化的评估基准。 Method: 构建了两个新基准：一个包含100道题的结构感知诊断集（SKYLENAGE-ReasoningMATH），提供每道题的长度、数字密度和符号复杂度等元数据；另一个是包含150道竞赛风格题目的SKYLENAGE-MATH，覆盖从高中到博士水平的七个数学子领域。在统一设置下评估了15种主流大模型，并分析了模型在不同学科和难度等级上的表现。 Result: 在竞赛套件上，最强模型准确率为44%，次优为37%，且性能随难度上升而下降，顶尖系统在博士级到高中级的保留率约为79%；在推理集上，最佳模型总体准确率达81%，最难子集显示出领先模型与中等模型之间的显著差距。 Conclusion: SKYLENAGE 系列基准提供了高难度、以推理为核心、覆盖广泛且具备精细难度校准和丰富元数据的数学评估工具，可作为未来数学推理能力评测的参考标准。 Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Seyma Yaman Kayadibi

Main category: cs.CL

TL;DR: 提出了一种名为人工年龄分数（AAS）的度量方法，用于评估大语言模型中记忆的老化现象，发现会话重置会导致表征性记忆退化，而语义记忆保持稳定。

Details

Motivation: 为了量化人工智能系统中因上下文重置导致的记忆退化现象，特别是语义与情节记忆的不对称表现。 Method: 提出了人工年龄分数（AAS），一种基于信息熵、对数尺度的可观察回忆行为指标，并在25天的双语实验中测试ChatGPT-5在有状态和无状态交互下的记忆表现。 Result: 在持续会话中，模型保持语义和情节记忆，AAS趋近理论最小值；会话重置后，情节记忆崩溃，AAS显著上升，表明结构老化。 Conclusion: AAS是一个理论严谨、任务无关的记忆老化诊断工具，适用于评估人工智能系统的记忆稳定性。 Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann's work on automata, Shannon's theories of information and redundancy, and Turing's behavioral approach to intelligence.

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Yisong Xiao,Aishan Liu,Siyuan Liang,Zonghao Ying,Xianglong Liu,Dacheng Tao

Main category: cs.CL

TL;DR: 提出了一种名为ARGRE的新型测试时解毒框架，通过在潜在表示空间中建模毒性转换，实现稳定且精确的奖励引导编辑，显著提升了大语言模型解毒的效果与效率。

Details

Motivation: 现有测试时解毒方法因缺乏对有毒到无毒输出之间转换空间的充分探索，导致干预不够精确，需要更精细的解毒策略。 Method: 提出ARGRE框架，识别无毒语义方向并插值有毒与无毒表示以揭示细粒度转换轨迹，利用这些轨迹构建自回归奖励模型，指导自适应两步编辑过程：基于期望奖励差距的方向引导和轻量级梯度优化。 Result: 在8个主流大语言模型上的实验表明，ARGRE在解毒效果上优于现有方法（毒性降低62.21%），推理时间减少47.58%，同时保持原始模型性能几乎不受影响。 Conclusion: ARGRE通过显式建模毒性转换路径和奖励引导表示编辑，提供了一种高效、精准且低侵入性的大语言模型测试时解毒方案。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

Hyeoneui Kim,Jeongha Kim,Huijing Xu,Jinsun Jung,Sunghoon Kang,Sun Joo Jang

Main category: cs.CL

TL;DR: 本研究开发了一个心理压力本体（MeSO），并评估了使用大语言模型（LLM）从叙述性文本中提取本体引导的压力相关信息的可行性。基于交易性压力模型和11种 validated 压力评估工具构建MeSO，并用其从35篇Reddit帖子中提取六类压力信息。Claude Sonnet 4实现了78.2%的准确率，结果显示该方法可行，有助于提升环境AI系统中压力记录的结构化与一致性。

Details

Motivation: 压力对健康有重大影响，但在电子健康记录中常以非结构化自由文本形式记录，导致信息不一致且易被忽视。现有环境AI技术虽能减轻文档负担，但生成的仍是非结构化文本，限制了临床应用。因此，需要一种方法将自由文本中的压力信息结构化。 Method: 结合交易性压力理论模型和11种已验证的压力评估工具，构建心理压力本体MeSO，并通过Ontology Pitfall Scanner!和专家评审优化。利用MeSO定义六个压力相关类别（压力源、应激反应、应对策略、持续时间、起始时间、时间模式），使用Claude Sonnet 4从35篇Reddit帖子中提取信息，并由人工评估准确性及本体覆盖度。 Result: 最终MeSO包含181个概念和八个顶层类别。在220个可提取的压力相关信息项中，LLM正确识别出172项（78.2%），误分类27项（12.3%），遗漏21项（9.5%）。所有正确提取的信息均能准确映射到MeSO，但有24个相关概念尚未被本体涵盖。 Conclusion: 本研究表明，采用本体引导的大语言模型可有效从叙述文本中结构化提取压力相关信息，具备提升环境AI系统中压力记录一致性与可用性的潜力。未来需在临床对话数据上验证，并比较不同LLM的表现。 Abstract: Stress, arising from the dynamic interaction between external stressors, individual appraisals, and physiological or psychological responses, significantly impacts health yet is often underreported and inconsistently documented, typically captured as unstructured free-text in electronic health records. Ambient AI technologies offer promise in reducing documentation burden, but predominantly generate unstructured narratives, limiting downstream clinical utility. This study aimed to develop an ontology for mental stress and evaluate the feasibility of using a Large Language Model (LLM) to extract ontology-guided stress-related information from narrative text. The Mental Stress Ontology (MeSO) was developed by integrating theoretical models like the Transactional Model of Stress with concepts from 11 validated stress assessment tools. MeSO's structure and content were refined using Ontology Pitfall Scanner! and expert validation. Using MeSO, six categories of stress-related information--stressor, stress response, coping strategy, duration, onset, and temporal profile--were extracted from 35 Reddit posts using Claude Sonnet 4. Human reviewers evaluated accuracy and ontology coverage. The final ontology included 181 concepts across eight top-level classes. Of 220 extractable stress-related items, the LLM correctly identified 172 (78.2%), misclassified 27 (12.3%), and missed 21 (9.5%). All correctly extracted items were accurately mapped to MeSO, although 24 relevant concepts were not yet represented in the ontology. This study demonstrates the feasibility of using an ontology-guided LLM for structured extraction of stress-related information, offering potential to enhance the consistency and utility of stress documentation in ambient AI systems. Future work should involve clinical dialogue data and comparison across LLMs.

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Runfei Chen,Shuyang Jiang,Wei Huang

Main category: cs.CL

TL;DR: SeMob 是一个基于大语言模型（LLM）的语义融合框架，用于动态预测人类移动性，通过多智能体系统从在线文本中提取时空相关事件信息，并结合创新的渐进融合架构提升预测精度。

Details

Motivation: 现有时空模型难以利用描述外部事件的文本信息，导致在突发事件影响下的人类移动性预测性能下降。 Method: 提出 SeMob 框架，采用基于 LLM 的多智能体系统自动提取和推理复杂文本中的时空相关信息，并通过渐进融合架构将细粒度上下文与时空数据结合。 Result: 在自建数据集上评估，SeMob 相比传统时空模型最大降低了 13.92% 的 MAE 和 11.12% 的 RMSE，在事件发生时空邻近区域表现尤为优越。 Conclusion: SeMob 能有效融合文本语义信息与时空数据，显著提升受外部事件影响的移动性预测准确性，具有更强的情境对齐能力。 Abstract: Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event's location and time of occurrence.

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Jiaqing Xie

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器（SAE）的改进语言模型引导方法，通过关注最相关的单个潜变量（top-1）并引入逐token衰减的引导策略，有效提升了数学推理能力。

Details

Motivation: 现有基于top-k SAE潜变量的引导方法常捕获标点等非语义特征，且恒定强度引导易导致输出重复等问题，因此需要更精准、动态的引导机制。 Method: 采用top-1 SAE潜变量选择策略以去除冗余特征，并设计逐token衰减的引导方式，避免输出退化。 Result: 在数学推理任务上，该方法优于均值激活差异基线，在IF-Eval上表现相当，且能有效激发逐步推理行为。 Conclusion: 聚焦关键潜变量并结合动态衰减策略可显著提升SAE在语言模型引导中的有效性与可控性。 Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

Punit Kumar Singh,Nishant Kumar,Akash Ghosh,Kunal Pasad,Khushi Soni,Manisha Jaishwal,Sriparna Saha,Syukron Abu Ishaq Alfarozi,Asres Temam Abagissa,Kitsuchart Pasupa,Haiqin Yang,Jose G Moreno

Main category: cs.CL

TL;DR: CultSportQA是一个新的基准，用于评估语言模型对60个国家和6个大洲的传统体育的理解能力，涵盖文本和图像模态的33,000道多选题。

Details

Motivation: 现有语言模型主要关注全球流行体育，忽视了区域性和本土体育传统，因此需要一个专门的基准来评估模型在多元文化体育知识上的表现。 Method: 构建了一个包含33,000道多选题的数据集，覆盖四个文化类别和三种问题类型（历史、规则、情景），并采用零样本、少样本和思维链提示方法在多种语言模型上进行评估。 Result: CultSportQA为评估AI在传统体育理解和推理能力方面提供了全面的多语言、跨文化基准。 Conclusion: 该基准填补了现有评估体系在文化多样性方面的空白，推动AI更好地理解全球范围内的传统体育。 Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs' understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI's ability to understand and reason about traditional sports.

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

Ruyue Liu,Rong Yin,Xiangzhen Bo,Xiaoshuai Hao,Yong Liu,Jinwen Zhong,Can Ma,Weiping Wang

Main category: cs.CL

TL;DR: 提出了一种面向文本属性图的结构感知自监督学习方法SSTAG，通过结合大语言模型和图神经网络的优势，提升了跨域迁移能力和可扩展性。

Details

Motivation: 图学习通常局限于单个数据集训练，难以跨图和任务迁移知识，且依赖大量标注数据；而图数据的异质性（如特征空间和结构差异）进一步增加了挑战。 Method: 利用文本作为统一表示媒介，设计了结构感知的自监督学习框架SSTAG，引入双知识蒸馏机制将大语言模型和图神经网络共同蒸馏到结构感知的MLP中，并通过内存机制存储典型图表示以增强泛化能力。 Result: 实验表明SSTAG在跨域迁移任务上优于现有最先进模型，具备高可扩展性、低推理成本且性能优越。 Conclusion: SSTAG有效融合了语义推理与结构建模能力，为文本属性图学习提供了高效、可扩展的解决方案，推动图预训练模型向大规模应用迈进。 Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model's generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang,Dong-Shan Jian,Xiang Li,Ce Meng,Ling-Shi Meng,Chen-Xu Yan,Zhi-Zhang Bian,Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA（Logical Chain Augmentation）是一个用于自动清洗科学语料库的新框架，通过补充缺失的逻辑步骤并分离科学原理与其推导过程，显著降低科学问答数据集中的错误率，从而提升科学AI的可靠性。

Details

Motivation: 现有的科学问答数据集中存在较高的错误率，常因答案中的逻辑跳跃和隐式推理导致，限制了科学AI的发展，因此需要一种高效方法来构建高质量、大规模的科学语料库。 Method: 提出LOCA框架，采用‘增强-评审’循环机制，对原始答案进行逻辑链增强，显式补全缺失的推理步骤，并将科学原理与后续推导分离，实现自动化清洗。 Result: 在具有挑战性的科学语料库上应用LOCA后，能够自动过滤噪声数据，通常将错误率从高达20%降低至2%以下。 Conclusion: LOCA为构建高质量科学语料库提供了一种可扩展且有效的方法，有助于更可靠地训练和评估科学AI模型。 Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Trung Duc Anh Dang,Ferdinando Pio D'Elia

Main category: cs.CL

TL;DR: 本文提出了一种基于120亿参数Gemma-3多语言模型的文本去毒化系统，结合LoRA微调与少样本、思维链提示技术，在15种语言上实现了高效的毒性语句中性重写，系统在高低资源语言中均排名第一。

Details

Motivation: 随着社交媒体快速发展而监管滞后，需要自动化工具帮助内容审核员大规模维护安全言论环境。 Method: 采用Gemma-3多语言Transformer模型，使用LoRA进行高效参数微调，并结合少样本学习和思维链提示；训练数据包括人工标注、机器翻译生成和模型自生成并经过Jaccard过滤的数据；推理时引入LaBSE检索邻居和显式毒性片段标注。 Result: 系统在风格迁移准确性、语义保持（LaBSE）和流畅性（xCOMET）指标上表现优异，位居高低资源语言榜单第一；消融实验显示少样本提升0.081分，思维链提示提升0.088分；方差分析表明语言资源状况是性能最强预测因子（η²=0.667, p<0.01）。 Conclusion: 该多语言文本去毒系统在多种语言下均表现出色，尤其通过提示工程显著提升性能，验证了其在不同资源条件下应用的可行性与有效性。 Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Carlo Bono,Federico Belotti,Matteo Palmonari

Main category: cs.CL

TL;DR: 提出一种基于单次推理的自监督方法，利用token级特征估计大语言模型在表格数据实体链接任务中的不确定性，显著降低计算成本的同时有效识别低精度输出。

Details

Motivation: 大语言模型在实体链接任务中缺乏高效可靠的不确定性估计方法，多轮推理方案计算开销大，限制了实际应用。 Method: 采用自监督学习方式，从单次推理的token级特征中提取信息，构建不确定性估计模型，避免多次生成带来的资源消耗。 Result: 在多个大语言模型和表格数据上的实验表明，该方法能以极低的计算成本生成有效的不确定性估计，准确识别错误预测。 Conclusion: 该方法为大语言模型在实体链接中的不确定性估计提供了高效、实用的解决方案，有助于推动其在真实场景中的部署。 Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran,Katharina Simbeck

Main category: cs.CL

TL;DR: 该研究通过在简·奥斯汀小说语料上训练GPT风格的Transformer模型，并结合稀疏自编码器（SAE）分析其隐藏状态，揭示了模型中与性别、阶级和社会责任等主题相关的可解释特征，表明LLM结合SAE可作为探索数据深层结构和偏见的有效工具。

Details

Motivation: 随着大语言模型使用大规模、未经筛选的数据进行训练，理解模型表示及其所吸收的数据内容变得愈发困难，因此需要有效方法来解析模型内部所编码的社会结构、主题和偏见。 Method: 在仅包含简·奥斯汀小说的语料上训练GPT-style Transformer模型，并在多个网络层上应用稀疏自编码器（SAE）来提取和解释隐藏状态中的稀疏特征。 Result: 成功识别出与性别、阶级、社会义务等关键叙事和概念相关的可解释特征，表明SAE能有效解码模型内部表征并反映训练数据的深层结构。 Conclusion: LLM结合SAE不仅有助于模型行为的解释，还可作为探测复杂数据集中潜在结构、主题和偏见的可扩展工具，为语料分析和模型可解释性提供了新路径。 Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish,Gustav Eje Henter,Éva Székely

Main category: cs.CL

TL;DR: 本文研究了语音大语言模型（SpeechLLMs）在多选题问答（MCQA）偏见基准测试中的表现，并探讨其行为是否能泛化到其他任务形式，尤其是长文本生成任务。作者通过LoRA微调三种SpeechLLMs以诱导特定的答题偏好，并评估这些偏好在不同MCQA基准和创造性生成任务中的迁移能力。结果表明，MCQA偏见基准的表现无法可靠预测模型在其他任务上的行为，说明当前MCQA偏见评测的跨任务泛化能力有限。

Details

Motivation: 现有的SpeechLLMs偏见评测主要依赖于多选题问答（MCQA）格式，假设模型在此类任务中的表现具有一致性和可迁移性。然而这一假设缺乏验证，尤其是在更真实的长文本生成任务中。因此，本文旨在检验MCQA偏见行为是否能在不同任务、声音和格式之间泛化。 Method: 作者使用LoRA适配器对三个SpeechLLMs进行微调，使其在MCQA任务中分别倾向于选择刻板印象、反刻板印象或中立/不确定答案。随后，在另一个不同的MCQA基准以及长篇创造性生成任务上评估这些行为的泛化能力。 Result: 实验结果显示，模型在一种MCQA任务中的偏见行为不能稳定迁移到另一种MCQA任务，更无法有效迁移到长文本生成任务。这表明当前MCQA偏见评测指标在跨任务泛化方面存在局限性。 Conclusion: 当前基于MCQA的偏见评测方法不足以反映SpeechLLMs在真实应用场景中的行为，缺乏跨任务的一致性与可预测性。因此，需要更全面的评估框架来衡量模型行为的可迁移性，本文也为此提出了一套新的评估方案。 Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Yunlang Dai,Emma Lurie,Danaé Metaxa,Sorelle A. Friedler

Main category: cs.CL

TL;DR: 本文提出了AI Watchman，一个用于公开测量和跟踪大语言模型（LLM）拒绝行为的纵向审计系统，以提高LLM内容审核政策透明度。研究使用400多个社会议题数据集，对GPT-4.1、GPT-5、DeepSeek（中英文）及OpenAI审核端点进行了审计，发现该系统能检测到未公开的公司政策变化，并揭示了不同公司和模型在内容审核上的差异。

Details

Motivation: 大语言模型的内容审核政策由公司制定且不透明，常通过模型拒绝生成特定内容来体现，这影响了公共话语。为增加透明度并监督这些黑箱机制，有必要对LLM的拒绝行为进行长期、公开的审计。 Method: 提出AI Watchman系统，采用包含400多个社会议题的数据集，对多个主流大模型（GPT-4.1、GPT-5、DeepSeek中英文）以及OpenAI的审核API进行纵向审计，分析其拒绝模式，并对拒绝类型进行定性分类。 Result: AI Watchman能够检测到公司内容审核政策的变化，包括未公开的调整；发现了不同公司和模型在拒绝行为上的显著差异；并对模型拒绝的形式进行了系统性归类。 Conclusion: 纵向审计对于理解大语言模型的内容审核行为具有重要价值，AI Watchman为实现这一目标提供了一个可行且有效的系统框架。 Abstract: Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

Can Lin,Zhengwang Jiang,Ling Zheng,Qi Zhao,Yuhang Zhang,Qi Song,Wangqiu Zhou

Main category: cs.CL

TL;DR: 提出了一种名为Retrieval-Judgment-Exploration (RJE)的框架，通过检索、判断和探索优化知识图谱问答中的推理过程，支持小规模开源大模型实现高效、低成本的问答性能。

Details

Motivation: 现有基于大模型的知识图谱问答方法受限于检索质量或依赖闭源大模型，且效率较低，难以在小模型上有效应用。 Method: 设计RJE框架，包含推理路径排序、问题分解和检索辅助探索等模块，实现对推理路径的检索、充分性判断及条件化扩展，并适配小规模语言模型。 Result: 实验表明，RJE在使用GPT-4o-mini等闭源模型时优于现有方法，同时使3B和8B等小规模开源模型无需微调即可达到有竞争力的结果，并显著减少LLM调用次数和token消耗。 Conclusion: RJE为知识图谱问答提供了一种高效、可扩展的解决方案，降低了对大型专有模型的依赖，推动了小模型在复杂推理任务中的应用。 Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Nathan Junzi Chen

Main category: cs.CL

TL;DR: 本研究通过零样本分类方法评估六种主流大语言模型的政治偏见，发现所有模型均表现出自由-威权倾向，并讨论了其对公共话语和政治格局的影响。

Details

Motivation: 生成式人工智能在政治话语中日益普及，但训练数据偏差、人类偏见和算法缺陷导致内在政治偏见问题亟需系统性评估。 Method: 采用零样本分类方法，结合意识形态对齐、话题相关性、回应情感和客观性四个维度，使用四个微调后的分类算法分析1800条来自六个大语言模型的响应。 Result: 所有六个大语言模型均表现出显著的自由-威权意识形态倾向，存在推理覆盖和预设拒绝现象；偏见可能通过人机交互影响公众话语，导致社会 conformity 或 polarization。 Conclusion: 大语言模型中的系统性政治偏见可能扭曲政治生态，需关注其在不同社会结构下的传播机制与心理影响。 Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region's pre-existing socio-political structures.

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

Main category: cs.CL

TL;DR: 该研究探讨了语言选择、社会语用框架和指令层级对gpt-oss-20b模型拒绝行为的影响，发现特定复合提示可显著提升协助率，并揭示了不同语言和角色设定下的信息泄露风险及评估不一致性。

Details

Motivation: 理解大型语言模型在不同提示框架下的拒绝行为机制，识别潜在的安全漏洞，并提高模型输出的可控性和可审计性。 Method: 通过在多个危害领域设计80次种子迭代实验，测试不同语言、角色设定和提示结构对模型行为的影响，并引入AI辅助加固方法以减少信息泄露。 Result: 复合提示使ZIP炸弹任务的协助率从0%提升至97.5%；德语和法语正式语体比英语更容易泄露信息；‘Linux终端’角色扮演在多数情况下绕过开发规则；AI加固方法可将泄露降至0%；13%的评估配对中存在不一致协助；内容审核API漏判较多实质性帮助输出；不同推理堆栈间拒绝率差异达5-10个百分点。 Conclusion: 模型行为高度依赖提示框架和语言环境，当前安全机制存在可被利用的漏洞，且评估与审核工具尚不完善，需加强可重复审计与防御策略设计。 Abstract: We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: 本研究针对GPT-OSS-20b模型在低资源语言（豪萨语）环境下的安全性和可靠性进行了红队测试，揭示了其存在偏见、文化不敏感和事实错误等问题，尤其是在使用礼貌性提示时安全机制被绕过，可能导致有害内容生成。

Details

Motivation: 质疑大型语言模型在代表性不足的社区用户中的可靠性，特别是在低资源语言背景下的安全对齐问题。 Method: 以豪萨语为例，通过最小化提示进行红队攻击测试，并结合调查数据（n=61）验证模型输出的危险性。 Result: 发现模型会生成有害、文化上不敏感且事实错误的内容，例如误认为剧毒杀虫剂和灭鼠剂可食用；无法区分生熟食品；使用贬损性谚语构建错误论点；且在礼貌或感激性提示下安全机制放松，表现出语言层面的奖励黑客行为。 Conclusion: 这些问题源于低资源语言环境中安全微调不足，暴露了当前红队测试的盲区，需加强针对少数语言的安全评估与调整。 Abstract: In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou,Jin Zhu,Pingfan Su,Kai Ye,Ying Yang,Shakeel A O B Gavioli-Akilagun,Chengchun Shi

Main category: cs.CL

TL;DR: 提出AdaDetectGPT，一种通过自适应学习witness函数来增强基于logits的文本检测器性能的新方法，在多种数据集和大语言模型组合下显著优于现有技术，最高提升达58%。

Details

Motivation: 现有基于logits的检测方法仅依赖对数概率，可能次优，需更有效的检测机制。 Method: 引入AdaDetectGPT，通过训练数据自适应学习witness函数以增强logits-based检测器，并提供统计保证。 Result: 在多种数据集和LLM组合下，AdaDetectGPT几乎全面优于现有方法，性能提升最高达58%。 Conclusion: AdaDetectGPT能有效提升AI生成文本检测性能，具有良好的理论保证和实际应用价值。 Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan,Victor Li,Qi Lei

Main category: cs.CL

TL;DR: 本文提出了一种名为渐进式自我反思（PSR）的推理时技术，用于增强大语言模型在生成文本时的自我监控与纠正能力，显著降低有害内容的生成风险，且无需额外训练。

Details

Motivation: 大语言模型虽然在自然语言处理方面表现出色，但可能生成有害或不适当内容，因此需要提升其安全性。 Method: 提出Progressive Self-Reflection（PSR）方法，使模型在推理时动态进行多轮自我反思以检测和修正输出，并引入轻量级预测器自适应地决定反思轮数以平衡安全性和计算开销。 Result: 在多个模型上应用PSR后，攻击成功率大幅下降（如Llama-3.1-8B-Instruct从77.5%降至5.9%），同时保持对良性任务的原有性能。 Conclusion: PSR是一种可扩展的测试时安全增强方法，能根据输入风险动态分配计算资源，有效提升大语言模型的安全性。 Abstract: Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Shenxu Chang,Junchi Yu,Weixing Wang,Yongqiang Chen,Jialin Yu,Philip Torr,Jindong Gu

Main category: cs.CL

TL;DR: 提出TraceDet框架，利用D-LLMs的多步去噪过程中的中间步骤进行幻觉检测，显著提升检测性能。

Details

Motivation: 现有针对自回归大模型的幻觉检测方法不适用于扩散大语言模型（D-LLMs），因它们依赖单步生成信号，而D-LLMs的幻觉信号出现在多步去噪过程中，亟需适配的新方法。 Method: 将D-LLMs的去噪过程建模为动作轨迹，每一步定义为基于前一中间输出的响应预测；通过识别对幻觉响应最具信息量的子轨迹，提取关键幻觉信号用于检测。 Result: 在多个开源D-LLMs上实验表明，TraceDet在幻觉检测AUROC指标上平均提升15.2%，显著优于基线方法。 Conclusion: TraceDet有效利用D-LLMs的多步去噪特性进行幻觉检测，为扩散型语言模型的可靠性提升提供了新思路。 Abstract: Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

Sumaiya Tabassum

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的BERT模型和大语言模型（如Llama）在孟加拉国电商评论情感分析中的应用，使用4000条孟加拉语和英语评论数据进行微调，结果表明Llama-3.1-8B模型性能最优，并采用LoRA和PEFT方法降低计算开销。

Details

Motivation: 由于语言复杂性和多语言环境带来的挑战，传统情感分析方法在准确性和资源消耗方面存在局限，因此需要探索更高效的大语言模型及其微调技术以提升多语言情感分析效果。 Method: 采用Transformer架构的BERT模型和多种大语言模型（包括Llama、Phi、Mistral等），在包含4000条孟加拉语和英语电商评论的数据集上进行微调，并应用参数高效微调技术（LoRA和PEFT）以减少计算资源消耗。 Result: 微调后的Llama-3.1-8B模型表现最佳，准确率达到95.5%，精确率93%，召回率88%，F1分数90%，优于其他对比模型。 Conclusion: 大语言模型结合参数高效微调技术在多语言情感分析任务中具有优越性能和实际部署潜力，尤其适用于资源受限环境。 Abstract: Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author's emotional state. This method is essential since it makes it easier to comprehend consumers' feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen,Jiefeng Chen,Rui Meng,Ji Yin,Na Li,Chuchu Fan,Chi Wang,Tomas Pfister,Jinsung Yoon

Main category: cs.CL

TL;DR: 本文提出了TUMIX，一种通过并行运行多个采用不同工具使用策略的代理来增强大语言模型推理能力的集成框架，在关键推理基准测试中显著优于现有方法。

Details

Motivation: 尽管集成了代码解释器和搜索等工具，但如何有效结合文本推理、编码和搜索以应对多样化问题仍缺乏实用指导。 Method: 提出TUMIX框架，多个代理并行运行，采用不同的工具使用策略，并基于问题和先前答案迭代共享和优化响应。 Result: 在Gemini-2.5-Pro和Gemini-2.5-Flash上，TUMIX相比最佳基线平均准确率提升达3.55%，且推理成本相近；通过置信度判断可将推理成本降至49%而保持性能。 Conclusion: 代理的多样性与质量对性能至关重要，可通过LLM自动优化代理设计进一步提升效果，TUMIX在性能与成本之间实现了良好权衡。 Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Israel Abebe Azime,Tadesse Destaw Belay,Atnafu Lambebo Tonja

Main category: cs.CL

TL;DR: 本文提出了一种评估深度研究工具能力的评估表，并以学术综述写作为用例，评估了OpenAI和Google的深度搜索在生成学术综述方面的表现，揭示了现有工具在覆盖目标领域方面的不足。

Details

Motivation: 为了系统评估具备智能代理能力的大型语言模型在知识密集型任务中的表现，特别是深度研究工具在自动生成学术综述方面的能力。 Method: 设计了一个评估表作为评价标准，选取学术调查写作作为具体任务，对OpenAI和Google的深度搜索生成的报告进行评估。 Result: 评估结果显示当前深度研究工具在全面覆盖目标研究领域方面存在明显短板，且与传统搜索引擎相比仍有显著差距。 Conclusion: 需要制定精心设计的评估标准来衡量深度研究工具的性能，当前技术尚不足以完全胜任复杂的学术综述生成任务。 Abstract: Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar,Sujay Sanghavi,Poulami Das

Main category: cs.CL

TL;DR: 本文提出了HiSpec，一种利用早期退出（EE）模型进行低开销中间验证的分层推测解码框架，通过重用缓存和隐藏状态提升资源效率，并在不牺牲准确性的前提下显著提高吞吐量。

Details

Motivation: 现有推测解码中的中间验证方法存在训练开销大、内存占用高和依赖近似启发式导致精度下降的问题，亟需一种高效且准确的中间验证机制。 Method: 提出HiSpec框架，利用专门训练的早期退出（EE）模型实现低开销的中间验证，并设计机制在草案模型、中间验证器和目标模型之间重用键值缓存和隐藏状态，同时周期性地用目标模型验证已接受的草案令牌以保证准确性。 Result: 在多个基准和模型上的实验表明，HiSpec相比基线单层推测解码平均提升吞吐量1.28倍，最高可达2.01倍，且未牺牲生成准确性。 Conclusion: HiSpec通过引入适合中间验证的EE模型并优化资源复用，在降低验证开销的同时保持高精度，有效提升了大规模语言模型推测解码的推理效率。 Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Maithili Kadam,Francis Ferraro

Main category: cs.CL

TL;DR: TAG-EQA是一种将因果事件图融入大语言模型输入的提示框架，通过九种提示配置在事件问答任务中显著提升性能，尤其在零样本和图增强思维链提示下效果明显。

Details

Motivation: 大语言模型在处理需要因果或时序推理的基于事件的问题时表现不佳，因此需要一种无需微调即可增强其事件推理能力的方法。 Method: 提出TAG-EQA框架，将结构化关系转化为自然语言语句，并结合三种策略（零样本、少样本、思维链）与三种输入模态（纯文本、纯图、图文结合）进行系统分析。 Result: 在TORQUESTRA基准上，TAG-EQA平均比纯文本基线提高5%准确率，零样本设置下最高提升12%，图增强思维链提示下最高提升18%。 Conclusion: 因果图可以在不微调的情况下有效增强大语言模型的事件推理能力，提示中融入结构化知识是一种灵活且有效的问答方法。 Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

Nicolás Aguirre,Ramiro Caso,Ramiro Rodríguez Colmeiro,Mauro Santelli,Joaquín Toranzo Calderón

Main category: cs.CL

TL;DR: 提出了一种基于语义嵌入距离的无结构评估方法，用于低成本且高效地自动分类语言模型生成的响应，性能接近人类标注者。

Details

Motivation: 现有语言模型响应评估方法成本过高（如LLM-as-a-Judge）或脱离实际（如字符串匹配、logprob），需要一种更高效且贴近真实场景的自动评估方法。 Method: 利用小参数量（低于100亿）的嵌入模型计算语义嵌入距离，将目标候选与任意生成文本进行匹配，实现对LM响应的鲁棒分类。 Result: 在3个数据集和3种不同LM架构上测试，与人类标注者的回归得分约为0.97，准确率约为96%。 Conclusion: 该结构无关的语义匹配方法在显著降低计算成本的同时，实现了与人类高度一致的评估性能，适用于实际生产环境中的LM质量评估。 Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

Mengyu Wang,Sotirios Sabanis,Miguel de Carvalho,Shay B. Cohen,Tiejun Ma

Main category: cs.CL

TL;DR: 提出专家问题分解（EQD）方法，通过两步微调框架和奖励函数优化领域特定的复杂问答，仅需少量数据和单个GPU即可高效训练，在金融领域的定量推理任务中显著优于现有方法。

Details

Motivation: 解决大语言模型在需要专业知识和复杂推理的领域（如金融）中定量推理能力不足的问题，提升特定领域问答性能。 Method: 提出专家问题分解（EQD），采用两步微调框架，并设计奖励函数评估生成子问题对问答效果的贡献，仅需数千样本和单A100 GPU训练，推理时间与零样本提示相当。 Result: 在四个金融领域基准数据集上，EQD在不同大模型上将问答性能提升0.6%至10.5%，优于最先进的领域微调模型和高级提示策略；分析发现单个支持性问题比详细步骤指导更有效。 Conclusion: EQD在低资源条件下实现了高效且高性能的领域特定问答，揭示了简洁子问题在复杂推理中的关键作用，为专业领域推理提供了实用且可扩展的解决方案。 Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You,Baojing Liu

Main category: cs.CL

TL;DR: ReSSFormer是一种递归稀疏结构化Transformer，通过循环推理、自适应稀疏注意力和自组织编码结构提升长上下文推理、计算效率和结构泛化能力。

Details

Motivation: 传统Transformer在长上下文推理、计算效率和结构泛化方面存在挑战，主要源于固定的层堆叠、密集注意力和对位置编码的依赖。 Method: 引入三个创新模块：用于迭代推理的循环推理与记忆单元（R2MU），用于高效上下文选择的自适应稀疏注意力模块（ASAM），以及用于无位置结构归纳的自组织编码器结构（SOES）。 Result: 在语言建模、多跳问答和结构敏感任务上，ReSSFormer在相似FLOPs和参数预算下持续优于强基线模型。 Conclusion: ReSSFormer通过递归推理、稀疏注意力和结构自组织实现了更好的可扩展性、效率和结构灵活性。 Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Zhenwen Liang,Ruosen Li,Yujun Zhou,Linfeng Song,Dian Yu,Xinya Du,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型内部隐藏状态的验证方法Clue，通过分析隐藏激活轨迹中的几何可分特征来判断输出正确性，无需训练参数，仅依靠过去经验中的“成功”和“失败”聚类进行分类，在多个任务上优于LLM-as-a-judge和基于置信度的方法。

Details

Motivation: 现有评估大语言模型输出质量的方法依赖文本层面信息或校准后的置信度，存在过拟合或对未校准模型失效的问题，而隐藏状态包含了更丰富的语义、词汇和置信信息，尚未被充分挖掘用于验证。 Method: 提出Clue（Clustering and Experience-based Verification）方法，利用推理过程中隐藏状态的变化（delta）作为表示，通过非参数化的最近质心距离，将候选解分类为‘成功’或‘失败’聚类，从而判断其正确性。 Result: Clue在AIME 24/25和GPQA等多个基准上优于LLM-as-a-judge基线，并匹敌或超过现代基于置信度的方法；在AIME 24使用1.5B模型时，将多数投票准确率从56.7%提升至70.0%（top-maj@16）。 Conclusion: 大语言模型的隐藏状态中蕴含了可用于输出验证的强信号，Clue通过简单有效的非参数化方式利用该信号，实现了高效且无需训练的输出质量评估。 Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model's internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to ``success'' and ``failure'' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Neal Gregory Lawton,Alfy Samuel,Anoop Kumar,Daben Liu

Main category: cs.CL

TL;DR: 本文评估并比较了检索增强生成（RAG）管道的多种微调策略，包括独立微调、联合微调和两阶段微调。实验表明，这些策略在生成质量指标（EM和F1）上的提升相当，但计算成本差异显著。最优策略的选择取决于训练数据是否包含上下文标签以及是否需要对嵌入和生成模型的学习率进行网格搜索。

Details

Motivation: 不同的RAG微调策略具有不同的成本和收益，但缺乏系统性比较，因此需要评估各种策略在性能和计算开销方面的表现，以指导实际应用中的选择。 Method: 作者评估了三种微调策略：独立微调（分别微调嵌入模型和生成模型）、联合微调（同时微调两个模型）和两阶段微调。在多个实验中，使用EM和F1作为生成质量的评价指标，并比较各策略的性能与计算成本。 Result: 实验结果显示，三种微调策略在EM和F1指标上的表现相近，均能有效提升RAG性能，但在计算资源消耗方面存在显著差异。联合微调通常更耗资源，而独立微调更具效率。此外，若训练数据不含上下文标签或需进行学习率网格搜索，则会影响策略选择。 Conclusion: 不同微调策略在性能上差异不大，但计算成本不同。选择最优策略应基于训练数据是否包含上下文标签以及是否需要调参。研究为实际部署RAG系统提供了关于微调策略的实用指导。 Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi,Sai Prasad Gudari,Atharva Negi,Praveen Raj Budime,Harsit Upadhya

Main category: cs.CL

TL;DR: 提出RAG-BioQA框架，结合检索增强生成与领域微调，生成基于证据的长篇生物医学答案，在PubMedQA上显著优于基线模型。

Details

Motivation: 现有生物医学问答系统多集中于短答案，缺乏临床决策所需的详细解释。 Method: 结合BioBERT嵌入、FAISS索引与多种重排序策略（BM25、ColBERT、MonoT5）进行上下文选择，并通过微调T5模型合成证据。 Result: 在PubMedQA数据集上，该方法在BLEU、ROUGE和METEOR指标上均显著优于基线模型。 Conclusion: RAG-BioQA有效提升了可访问的、基于证据的生物医学知识检索水平，适用于需要详尽解释的临床场景。 Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

Yu-Cheng Chih,Ming-Tao Duan,Yong-Hao Hou

Main category: cs.CL

TL;DR: 本文提出了一种三阶段稳定化流程PureTC-1B，通过LoRA适配器提升Llama-3.2-1B-Instruct在传统中文（TC）生成中的语言纯度，显著减少非TC字符输出，在真实场景基准和命名实体翻译任务中均表现优异。

Details

Motivation: 小型语言模型（SLMs）在传统中文（TC）应用中存在生成不稳定、混杂非TC字符或语码转换的问题，限制了其在实际场景中的部署可靠性。 Method: 采用参数高效的LoRA适配器，结合三个阶段：基于TC语料的持续预训练（CPT）、带指令数据的监督微调（SFT）、以及使用TC遵循偏好的直接偏好优化（DPO），在不重新训练整个模型的前提下提升TC生成稳定性。 Result: 在模拟真实使用的基准测试中，PureTC-1B相比基础模型减少了51.3%的非TC输出标记（micro-average）；在命名实体翻译任务中，相较于Llama-3B和Qwen-1.5B分别减少77.2%和57.2%的错误语言标记。 Conclusion: 即使在1B规模的小型模型上，也能实现强健的TC语言一致性，该方法具有可复现性、仅需适配器且对硬件友好，为增强TC及其他非英语语言的语言稳定性提供了实用方案。 Abstract: Small Language Models (SLMs) enable cost-effective, on-device and latency-sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token-level instability - models unpredictably emit non-TC characters or code-switch into other languages. We address this practical reliability gap by creating PureTC-1B, a three-stage stabilization pipeline for Llama-3.2-1B-Instruct (an open-weight, instruction-tuned model released by Meta) using parameter-efficient LoRA adapters. Our method combines Continual Pre-Training (CPT) on TC-centric corpora, Supervised Fine-Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) using TC-adherence preferences to improve monolingual robustness without full-model retraining. On a benchmark designed to simulate real-world usage, PureTC-1B achieves a 51.3% relative reduction (micro-average) in non-TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC-1B further reduces incorrect-language tokens by 77.2% relative to Llama-3B and 57.2% relative to Qwen-1.5B, indicating that robust TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non-English languages.

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong,Yuheng Li,Yuqing Wu,Wenwen Ouyang,Wei Zhu,Jiechao Gao

Main category: cs.CL

TL;DR: 本文提出了一种名为AMAS的新型大语言模型多智能体系统框架，通过动态图设计实现任务自适应的智能体协作结构，显著提升了在问答、数学推理和代码生成等任务上的性能。

Details

Motivation: 现有的多智能体系统架构依赖固定的手工设计图结构，缺乏对上下文的响应能力，限制了其在多样化任务中的有效性。 Method: 提出AMAS框架，引入一个轻量级的大语言模型驱动的动态图设计器，根据任务需求和输入特性自动构建最优的智能体交互拓扑结构，实现查询路径的智能化调度。 Result: 在多个基准任务（包括问答、数学推理和代码生成）上验证了AMAS的有效性，结果表明其性能显著优于现有的单智能体和多智能体方法，且适用于不同的大语言模型架构。 Conclusion: 上下文敏感的结构可适应性是高性能大语言模型多智能体系统的关键要素，AMAS为工业级自主多智能体系统的构建提供了新的范式。 Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins,Aditya Pramar,Rodney Beard,Rohitash Chandra

Main category: cs.CL

TL;DR: 本研究探讨了不同机器学习模型识别大语言模型（LLM）中的“越狱提示”（jailbreak prompts）的能力，发现基于BERT模型的端到端微调在现有数据集上表现最佳，并指出提示词结构中的显式自反性可能是越狱行为的信号。

Details

Motivation: 大语言模型存在安全漏洞，恶意用户可通过输入操纵触发不当响应，因此需要有效识别越狱提示以维护系统安全。 Method: 采用多种机器学习模型分析区分越狱提示与正常提示的效果，重点评估对未知越狱策略的检测能力，并通过关键词可视化分析特征。 Result: 在当前数据集上，端到端微调的BERT模型表现出最优的越狱提示识别性能，且可视化结果显示越狱提示常包含特定关键词和结构特征。 Conclusion: 显式自反性的提示结构可能预示越狱意图，基于BERT的模型是目前检测越狱提示最有效的方法。 Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Zhaoxin Feng,Jianfei Ma,Emmanuele Chersoni,Xiaojing Zhao,Xiaoyi Bao

Main category: cs.CL

TL;DR: 本文探讨了通过在Llama架构中引入双向注意力机制和对比学习来克服自回归大语言模型在文本嵌入任务中因单向注意力机制受限的问题。

Details

Motivation: 自回归大语言模型在语言理解和生成方面表现出色，但由于单向注意力机制的限制，在文本嵌入和语义表示探针任务中的应用较慢。因此，研究如何突破这一限制具有重要意义。 Method: 通过对Llama架构的不同变体进行额外训练，逐步引入双向注意力机制，并结合无监督/有监督对比学习方法进行测试。 Result: 实验结果显示，引入双向注意力机制和对比学习能够显著提升模型在文本嵌入任务中的表现，改善其语义表示能力。 Conclusion: 双向注意力机制可以有效缓解自回归模型在文本嵌入任务中的局限性，为提升其语义表示能力提供了可行路径。 Abstract: Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

Mudita Khurana,Raunak Jain

Main category: cs.CL

TL;DR: 本文提出了CLASP框架和CLC评分，用于评估闭环自主安全系统在网络安全生命周期中的智能体能力，填补了现有研究在统一评估体系和基准测试方面的空白。

Details

Motivation: 当前AI驱动的攻击手段快速发展，而防御系统仍分散于孤立功能，缺乏统一框架和评估标准来衡量闭环自主安全代理的能力。 Method: 提出CLASP框架，将安全生命周期与智能体核心能力对齐，并设计CLC评分作为量化闭环能力和操作效能的综合指标，通过对21项代表性研究进行分析验证其有效性。 Result: 成功应用CLASP分析21项工作，识别出系统优势与能力差距，定义了可衡量闭环程度与性能的CLC分数，并提出了构建闭环基准的要求。 Conclusion: CLASP和CLC评分为评估和推进闭环自主安全代理提供了必要的术语、诊断工具和度量标准，有助于提升整体安全系统的智能化水平。 Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu,Jianfeng He,Hang Su,Ruixue Lian,Yi Nian,Jake Vincent,Srikanth Vishnubhotla,Robinson Piramuthu,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了MDSEval，首个针对多模态对话摘要的元评估基准，包含图像共享对话、摘要及人类对八个质量维度的评分，并提出基于跨模态互斥关键信息（MEKI）的过滤框架以保证数据质量，揭示了现有评估方法在评估先进MLLM生成摘要时的局限性和偏差。

Details

Motivation: 为了支持有效的多模态对话摘要（MDS）模型的发展，需要可靠的自动评估方法，而这些方法依赖于基于人类标注的高质量元评估基准。现有的评估体系缺乏针对MDS特性的系统性评价标准和数据支持。 Method: 构建了包含图像对话、摘要和人类评分的MDSEval基准；提出基于互斥关键信息（MEKI）的过滤框架以提升数据质量；定义并形式化了适用于MDS的多个评估维度；对当前最先进的评估方法进行了系统评测。 Result: MDSEval是首个面向MDS的元评估基准，涵盖八种明确的质量维度；基于MEKI的过滤策略有效提升了数据的丰富性和质量；实验表明现有评估方法难以区分来自先进MLLM的摘要，且易受多种偏见影响。 Conclusion: 本研究填补了多模态对话摘要自动评估领域的空白，提供了高质量的基准数据与评估框架，揭示了当前评估方法的不足，为未来MDS评估方法的设计提供了方向。 Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

He Zhang,Anzhou Zhang,Jian Dai

Main category: cs.CL

TL;DR: FOR-Prompting是一种无需重训练、基于提示的角色驱动推理协议，通过引入质疑机制（Objectioner）实现自我修正，在数学推理和开放性任务中显著提升大小模型的准确性与推理质量。

Details

Motivation: 现有推理方法如思维链（CoT）缺乏外部质疑机制来触发自我修正，限制了模型在复杂或易错问题上的表现，因此需要一种能激发内部反思且适用于各类模型的通用提示协议。 Method: 提出FOR-Prompting协议，包含三个角色：Defender（提出答案）、Objectioner（提出不直接修复的质疑性问题）、Host（确保逻辑一致性和对话终止），通过结构化对话轮次实现无监督的自我修订，全程仅依赖提示工程，无需工具或人工干预。 Result: 在GSM8K上比单次提示准确率提升约22个百分点，与CoT相当但推理和连贯性评分高出10%以上；Llama3.2:1b小模型准确率提升约19%；可自主纠正难题中的错误，并在开放任务中增强探索与假设显式化。 Conclusion: FOR-Prompting作为一种模型无关、纯提示级别的推理框架，有效引入外部质疑促进自我修订，显著提升大小模型的推理质量，具有在个人设备和大规模研究中应用的潜力。 Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.

[57] How Do Language Models Compose Functions?

Apoorv Khandelwal,Ellie Pavlick

Main category: cs.CL

TL;DR: 研究大型语言模型（LLM）在解决两跳事实回忆任务时是否使用组合机制，发现存在“组合性差距”，并识别出组合式和直接式两种处理机制，其选择与嵌入空间几何结构相关。

Details

Motivation: 探讨大型语言模型在执行组合任务时是否真正采用组合性机制，而非仅仅依赖表面模式匹配。 Method: 通过logit lens分析残差流激活，研究LLM在两跳事实回忆任务中的内部处理机制，并分析嵌入空间的几何特性。 Result: 确认了LLM存在组合性差距；发现了组合式和直接式两种机制；发现机制选择与是否存在从输入到输出的线性映射有关。 Conclusion: LLM并非总是以组合方式解决问题，其机制选择受嵌入空间几何影响，部分情况下采用非组合的“惯用”路径。 Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Seungseop Lim,Gibaeg Kim,Wooseok Han,Jean Seo,Hyunkyung Lee,Jaehyo Yoo,Eunho Yang

Main category: cs.CL

TL;DR: 本文提出了一种针对医疗预咨询对话中大语言模型出现“格式惯性”问题的解决方案，通过重新平衡训练数据中的对话轮次分布来缓解模型生成重复且无诊断信息的问题。

Details

Motivation: 在医疗领域，监督微调（SFT）常用于适配大语言模型进行多轮对话生成，但训练数据通常存在轮次分布不均的问题，导致模型产生‘格式惯性’——即生成格式正确但诊断信息不足的重复问题。 Method: 采用一种简单且以数据为中心的方法，重新调整训练数据中对话轮次的分布，使其更加均衡，从而减轻格式惯性现象。 Result: 实验结果表明，该方法显著缓解了医疗预咨询任务中大语言模型的格式惯性问题，提升了长对话中生成内容的相关性和信息量。 Conclusion: 通过数据层面的轮次分布重平衡，可有效改善SFT训练下大语言模型在长医疗对话中的退化行为，为提升专业领域对话质量提供了可行路径。 Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung,Neel Joshi,Pratyusha Sharma,Youngjae Yu,Vibhav Vineet

Main category: cs.CL

TL;DR: 本文提出了MathLens基准，用于评估多模态推理模型在几何问题中的感知、推理和整合能力，揭示了不同训练方法对各子技能的影响及模型的薄弱环节。

Details

Motivation: 现有评估方法仅依赖总体准确率，难以揭示多模态推理模型在复杂任务（如教科书级几何题）中各子技能的进步情况，因此需要更细粒度的评估基准。 Method: 构建MathLens基准，将性能分解为感知、推理和整合三个部分，并提供视觉图示、文本描述、控制性问题和感知探针等注释，基于符号化问题规范确保一致性与鲁棒性。 Result: 发现强化学习主要提升感知能力，文本监督微调通过反思性推理间接改善感知；推理能力仅在感知同步提升时才增强；整合能力最弱，成为瓶颈；强化学习提高图表变化下的一致性，而多模态监督微调因过拟合降低鲁棒性。 Conclusion: 多模态推理模型的各子技能发展不均衡，整合能力是当前短板，未来应针对性地设计训练方法以提升整体性能与鲁棒性。 Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

[60] Machine-interpretable Engineering Design Standards for Valve Specification

Anders Gjerver,Rune Frostad,Vedrana Barisic,Melinda Hodkiewicz,Caitlin Woods,Mihaly Fekete,Arild Braathen Torjusen,Johan Wilhelm Kluwer

Main category: cs.CL

TL;DR: 本文提出了一种将工程设计标准中的信息转化为模块化、可重用、机器可解释的本体的方法，并应用于工厂设计和设备选型的质量保证中。

Details

Motivation: 尽管工业界致力于数字化，当前的设计标准仍以文档为中心，难以实现自动化和互操作性，因此需要将标准内容转化为机器可读、可推理的语义形式。 Method: 采用建模模式将国际管道、材料和阀门设计标准中的文本和表格知识转化为符合W3C标准的模块化本体，并与顶层本体ISO DIS 23726-3（IDO）对齐；通过语义资产模型和OWL个体表示阀门及环境条件，结合语义推理和可执行设计规则进行合规性验证。 Result: 成功基于国际标准构建了可互操作的模块化本体，并在阀门选型过程中实现了自动化验证，能够判断特定阀门数据表（VDS）是否符合行业标准以及产品型号是否满足规格要求。 Conclusion: 基于IDO的模块化本体为设计标准的数字化提供了有效路径，支持语义推理在设备选型中的应用，展示了向智能标准转型的潜力，有助于标准机构推动标准化工作的数字化。 Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create "functional location tags" as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan,Jie Xu,Qiguang Chen,Junhao Dong,Libo Qin,Xinfeng Li,Haining Yu,Xiaohua Jia

Main category: cs.CL

TL;DR: 本文提出了一个名为“拒绝指数”（Refusal Index, RI）的新指标，用于准确衡量大语言模型在未知问题上的知识感知拒绝能力。RI定义为拒绝概率与错误概率之间的斯皮尔曼秩相关系数，并通过轻量级双轮评估方法进行实际测量。实验表明，RI能稳定、一致地评估模型在事实任务中的拒绝能力，揭示了当前模型尽管准确率高，但拒绝行为可能不可靠的问题。

Details

Motivation: 现有指标无法准确衡量大语言模型在超出其知识范围时的拒绝回答能力：基于拒绝率的指标受模型拒绝倾向影响而产生偏差，校准指标则依赖代理过程而非真实拒绝行为。因此需要一种更可靠、不受拒绝率干扰的评估方法。 Method: 提出拒绝指数（RI），即拒绝概率与错误概率之间的斯皮尔曼秩相关性；设计一种轻量化的双轮评估方法，通过两次标准评测运行中观察到的拒绝率来高效估计RI。 Result: 在16个模型和5个数据集上的实验表明，RI能够准确量化模型的知识感知拒绝能力，且结果稳定，不受模型整体准确率或拒绝率变化的影响，提供了一致的模型排序。 Conclusion: 拒绝指数（RI）是一种可靠、无偏的指标，能有效衡量大语言模型在事实任务中对未知问题的拒绝能力。研究发现，尽管当前LLM在事实准确性上表现良好，但其拒绝行为仍不可靠，提示需将RI与传统准确率结合，以实现更全面的事实性评估。 Abstract: Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model's actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Ivan Leonidovich Litvak,Anton Kostin,Fedor Lashkin,Tatiana Maksiyan,Sergey Lagutin

Main category: cs.CL

TL;DR: 本研究评估了16种无监督指标在从俄罗斯司法判决中提取七个语义块时的有效性，基于7,168条专家评分为基准，发现词频连贯性和覆盖率/完整性指标与专家评分最一致，而法律术语密度呈负相关；LLM评估得分中等，表明其在法律文本中的适用性有限；结果表明无监督指标可用于大规模筛选，但无法完全替代高风险场景下的专家判断。

Details

Motivation: 随着人工智能在法律自然语言处理中的快速发展，亟需可扩展的方法来评估从司法判决中提取文本的质量，尤其是在缺乏人工标注真实数据的情况下。 Method: 研究采用16种无监督指标（涵盖文档级、语义、结构、伪真实标签和法律特定类别），在1,000份匿名俄罗斯司法判决上评估七类语义块的提取质量，并通过7,168条1-5分制的专家评分进行验证；使用自助法相关性、Lin一致性相关系数（CCC）和平均绝对误差（MAE）评估指标性能。 Result: Term Frequency Coherence（Pearson r = 0.540, CCC = 0.512, MAE = 0.127）和Coverage Ratio/Block Completeness（r = 0.513, CCC = 0.443, MAE = 0.139）与专家评分最一致；Legal Term Density呈强负相关（r = -0.479, CCC = -0.079）；LLM Evaluation Score表现中等（r = 0.382, CCC = 0.325），显示其在法律文本评估中能力有限。 Conclusion: 无监督指标（包括基于LLM的方法）可用于法律文本提取的可扩展初步筛选，但由于与专家判断的相关性中等且一致性较低，尚不能替代高风险法律场景中的人工评估；该研究为司法分析和伦理AI部署提供了无需标注的评估工具。 Abstract: The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

Xin Liu,Rongwu Xu,Xinyi Jia,Jason Liao,Jiao Sun,Ling Huang,Wei Xu

Main category: cs.CL

TL;DR: 本文提出了一种名为FraudSquad的混合检测模型，用于识别由大语言模型生成的高度逼真的垃圾评论，该模型结合了语义和行为信号，在多个数据集上显著优于现有方法。

Details

Motivation: 大语言模型生成的垃圾评论极具说服力且难以检测，威胁在线平台可信度，亟需有效的检测手段。 Method: 构建三个基于不同大语言模型生成的垃圾评论数据集，并提出FraudSquad模型，结合预训练语言模型的文本嵌入与门控图变换器进行垃圾节点分类。 Result: FraudSquad在三个LLM生成的数据集上比现有最优方法精度提升最高达44.22%，召回率提升43.01%，并在人工撰写垃圾评论数据集上表现良好，同时模型体积小、所需标注数据少。 Conclusion: FraudSquad是一种高效、实用的LLM时代垃圾评论检测方案，研究还提供了新的合成数据集和实证证据，强调了升级反垃圾技术的紧迫性。 Abstract: The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: https://anonymous.4open.science/r/FraudSquad-5389/.

Dane Williamson,Yangfeng Ji,Matthew Dwyer

Main category: cs.CL

TL;DR: 大型语言模型在数学问题解决上表现出色，但在问题句法偏离训练分布时容易失败。本文提出“句法盲点”概念，指出模型因表面形式与内部表征的脆弱耦合而误用推理策略。通过基于正确示例的句法重构可显著提升准确率，并利用依存局部性理论（DLT）量化句法复杂度，发现其与错误率正相关，表明许多推理错误源于结构错配而非概念困难。

Details

Motivation: 研究LLMs在面对句法变异时的推理失败原因，探究这些错误是否源于模型对表面形式的过度依赖而非真正的数学能力不足。 Method: 识别模型在语义简单但句法陌生问题上的失败模式；使用来自正确回答问题的句法模板重新表述错误问题；引入基于依存局部性理论（DLT）的句法复杂度度量，并分析其与错误率的关系。 Result: 重构后的问题在保持语义不变的情况下显著提高了模型的准确率；DLT得分越高，模型在多个数据集上的失败率越高；证明句法复杂度与推理失败存在强相关性。 Conclusion: LLMs的许多推理错误源于句法结构与内部表征之间的错配，而非概念理解缺陷；通过句法感知的干预可以揭示并缓解这类归纳偏差，提示未来需增强模型对多样化句式的鲁棒性。 Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Shicheng Liu,Kai Sun,Lisheng Fu,Xilun Chen,Xinyuan Zhang,Zhaojiang Lin,Rulin Shao,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: 本文提出了SCRIBES，一种基于强化学习的大规模半结构化网页内容提取框架，利用网站内页面布局的相似性生成可重用的提取脚本，并通过CommonCrawl数据上的合成标注迭代训练，显著提升了脚本质量和下游任务性能。

Details

Motivation: 网页中的表格、列表和信息框等半结构化内容包含大量事实数据，但其格式复杂，现有方法在泛化能力或资源效率上存在不足，难以实现高效可靠的信息提取。 Method: 提出SCRIBES框架，采用强化学习，利用同一网站内页面布局的相似性作为奖励信号，生成可跨多个相似页面复用的提取脚本，并在CommonCrawl的野外数据上通过合成标注进行迭代训练。 Result: 实验表明，该方法在脚本质量上超越强基线超过13%，并使GPT-4o在下游问答任务中的准确率提升超4%。 Conclusion: SCRIBES实现了可扩展且资源高效的网络信息提取，为大规模半结构化数据抽取提供了新思路。 Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Ece Takmaz,Lisa Bylinina,Jakub Dotlacil

Main category: cs.CL

TL;DR: 本文探讨了在低资源环境下开发语言模型和多模态模型的方法，以应对视觉-语言模型与儿童语言习得之间的数据量差异。研究发现多模态模型在纯语言任务上表现较差，而通过将多模态模型与纯语言模型进行参数融合（模型合并），可以在保持多模态性能的同时改善其语言能力。

Details

Motivation: 当前的视觉-语言模型依赖大量参数和数据，远超儿童学习语言时接触的数据量。为了更贴近儿童语言习得过程，需要在低资源、发展合理的数据集上构建模型，并解决多模态模型在语言任务中表现不佳的问题。 Method: 在低资源设置下训练语言模型和多模态模型，使用发展上合理的数据集；采用加权线性插值方法进行模型合并，融合多模态模型与纯语言模型的参数，以保留语言能力。 Result: 所提出的多模态模型在BabyLM挑战赛中优于之前的基线模型；实验表明多模态模型在语法相关的语言任务上表现较差，而通过模型合并可部分缓解该问题，同时维持多模态任务的表现。 Conclusion: 模型合并是一种有效策略，能够在不牺牲多模态性能的前提下增强多模态模型的语言理解能力，为构建更符合人类学习机制的多模态语言模型提供了可行路径。 Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Yisu Wang,Ming Wang,Haoyuan Song,Wenjie Huang,Chaozheng Wang,Yi Xie,Xuming Ran

Main category: cs.CL

TL;DR: 提出REPAIR框架，通过渐进式自适应干预与重新整合，实现大模型的高效、低代价、精确更新，显著提升编辑准确率并减少知识遗忘。

Details

Motivation: 解决大语言模型在后训练阶段因获取新知识或纠正错误成本高以及重训练带来的副作用问题。 Method: 设计REPAIR框架，采用闭环反馈机制和动态内存管理来缓解大规模连续编辑的不稳定性；结合频繁的知识融合和强局部性保护，避免传统方法忽略的涟漪效应。 Result: 实验表明，REPAIR在多个模型家族上编辑准确率提升10%-30%，显著减少知识遗忘。 Conclusion: REPAIR为构建可靠、可扩展且持续演进的大语言模型提供了一个鲁棒的编辑框架。 Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu,Hao Xu,Xuhong Chen,Wei Chen,Yee Whye Teh,Ning Miao

Main category: cs.CL

TL;DR: 本文系统介绍了用于提升大语言模型（LLM）推理能力的奖励模型（RM），综述了其架构、训练方法、评估技术及在推理、数据合成和强化学习微调中的应用，并探讨了RM在选择、泛化、评估和增强方面的开放问题。

Details

Motivation: 奖励模型在提升大语言模型推理性能中起关键作用，但缺乏系统性介绍和全面的应用综述，需要梳理现有研究并提出未来方向。 Method: 对奖励模型的基本概念（如架构、训练和评估方法）进行回顾，并分类总结其在LLM推理中的三大应用：引导生成与输出选择、数据合成与自我改进、强化学习微调；同时结合实证分析讨论开放问题。 Result: 提供了关于奖励模型在LLM推理中应用的全面综述，明确了其核心应用场景和当前挑战，基于现有研究和实证结果提出了若干开放问题。 Conclusion: 该文为奖励模型的有效部署与进一步发展提供了可操作的见解，有助于推动LLM推理能力的进步。 Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

[69] Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli,Simone Sestito,Iacopo Masi

Main category: cs.CL

TL;DR: 提出逆语言建模（ILM）框架，统一提升大语言模型对输入扰动的鲁棒性，并通过反转输出实现对有害输入的溯源与原生接地。

Details

Motivation: 当前大语言模型的防御机制零散且不成熟，缺乏类似传统分类器的系统性鲁棒性方法。 Method: 提出逆语言建模（ILM）框架，通过反转模型输出来识别潜在有毒或不安全的输入触发，并增强模型对输入扰动的鲁棒性。 Result: ILM使大语言模型从静态生成器转变为可分析、鲁棒的系统，支持红队测试，并为下一代更可控、可信的模型奠定基础。 Conclusion: ILM为大语言模型提供了统一的鲁棒性与接地机制，推动其在安全性与可控性方面的发展。 Abstract: The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Qi He,Cheng Qian,Xiusi Chen,Bingxiang He,Yi R.,Fung,Heng Ji

Main category: cs.CL

TL;DR: 本文提出了Veri-R1，一种基于在线强化学习的框架，使大语言模型能与搜索引擎交互，通过奖励信号优化其在声明验证中的规划、检索和推理能力，显著提升了验证准确性和证据得分。

Details

Motivation: 现有声明验证方法多依赖提示工程或预设推理流程，缺乏统一的训练范式来提升模型的检索与推理技能，难以适应真实场景的动态交互需求。 Method: 提出Veri-R1框架，采用在线强化学习让大语言模型与搜索引擎动态交互，通过设计奖励信号来引导和塑造模型的规划、证据检索与推理行为，实现端到端的联合优化。 Result: 实验表明，Veri-R1将联合准确率最高提升30%，证据得分翻倍，且常优于更大规模的模型；消融研究验证了各奖励组件的作用以及输出logits与标签准确性之间的关系。 Conclusion: 在线强化学习能有效提升大语言模型在声明验证中的精确性与忠实性，Veri-R1为未来LLM驱动的验证系统提供了可行路径和基础支持。 Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

Adina Nicola Dobrinoiu,Ana Cristiana Marcu,Amir Homayounirad,Luciano Cavalcante Siebert,Enrico Liscio

Main category: cs.CL

TL;DR: 本研究探讨了语言模型是否能通过多维主观标注（情感、情绪、论点和话题）来预测个体的价值观解释，结果表明结合多个维度的信息能显著提升预测性能。

Details

Motivation: 由于价值观的解释受社会文化和个人经历影响而具有主观性，因此开发能够理解多样化人类视角并避免偏向主流观点的AI系统至关重要。 Method: 利用情感、情绪、论点和话题（SEAT）四个维度的主观标注作为个体解释视角的代理，评估在零样本和少样本设置下语言模型预测个体价值观的能力。 Result: 实验显示，同时提供所有SEAT维度信息时模型表现最佳，优于单一维度或无个人信息的基线；不同标注者之间的差异也凸显了考虑个体主观性的重要性。 Conclusion: 这是首次在控制环境下探索标注行为对价值观预测的影响，为未来大规模验证奠定了基础。 Abstract: Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.

[72] Exploring Database Normalization Effects on SQL Generation

Ryosuke Kohita

Main category: cs.CL

TL;DR: 本研究首次系统探讨了数据库模式规范化对自然语言转SQL（NL2SQL）性能的影响，发现去规范化模式在简单检索查询中表现更优，而规范化模式在聚合查询中更具优势。

Details

Motivation: 现有NL2SQL研究多忽略模式设计的影响，本文旨在填补这一空白，系统评估不同规范化级别对模型性能的作用。 Method: 构建具有形式化规范化（1NF-3NF）的合成数据集和具有实际方案的真实学术论文数据集，评估八种主流大语言模型在不同模式下的表现。 Result: 去规范化模式在简单查询中准确率高，尤其适用于零样本设置；规范化模式在聚合查询中表现更好，能有效避免数据重复和NULL值问题，但需少量示例即可缓解其带来的连接错误等问题。 Conclusion: NL2SQL系统的最优模式设计取决于目标查询类型，应根据应用场景自适应选择模式，并在开发中重视模式设计的影响。 Abstract: Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization's impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Md Arid Hasan,Firoj Alam,Md Fahad Hossain,Usman Naseem,Syed Ishtiaque Ahmed

Main category: cs.CL

TL;DR: 本文提出了首个用于孟加拉语的多任务仇恨言论数据集BanglaMultiHate，并通过多种模型比较评估了在低资源环境下大语言模型的适应性，强调了文化与语言背景预训练的重要性。

Details

Motivation: 现有针对低资源语言如孟加拉语的仇恨言论检测研究多为单任务且覆盖范围有限，缺乏对多维度信号（类型、严重程度、目标）的综合分析，因此需要更全面的数据集和评估方法。 Method: 构建了一个大规模手动标注的多任务孟加拉语仇恨言论数据集BanglaMultiHate，并系统比较了经典基线、单语预训练模型以及在零样本提示和LoRA微调下的大语言模型表现。 Result: 实验表明，尽管经过LoRA微调的大语言模型表现可与BanglaBERT相媲美，但具有文化与语言基础的预训练对提升模型性能至关重要。 Conclusion: BanglaMultiHate为低资源语言的仇恨言论检测提供了更强的基准，研究结果强调了开发文化适配的内容审核工具的重要性。 Abstract: Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung

Main category: cs.CL

TL;DR: 本研究采用叙事学视角，通过约束性决策框架分析大语言模型作为计算作者的创作过程，发现模型在创作中普遍优先考虑风格而非人物、事件或场景，并揭示了不同模型间独特的创造性偏好。

Details

Motivation: 现有对大语言模型创造力的评估多关注输出质量，缺乏对生成过程的深入分析，因此需要一种新的方法来系统考察模型的创作行为。 Method: 引入基于约束的决策机制，结合控制性提示赋予模型作者角色，利用叙事学理论分析模型在风格、人物、事件和场景等元素上的选择偏好及其解释逻辑。 Result: 研究发现大语言模型在创作决策中一致倾向于‘风格’元素；不同模型展现出可识别的创造性决策特征；模型对其选择的解释揭示了潜在的创作逻辑差异。 Conclusion: 该方法为评估AI的作者性创造力提供了一种新颖且系统的工具，强调应从创作过程而非仅输出结果来理解LLM的创造力。 Abstract: Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Siddhant Arora,Haidar Khan,Kai Sun,Xin Luna Dong,Sajal Choudhary,Seungwhan Moon,Xinyuan Zhang,Adithya Sagar,Surya Teja Appini,Kaushik Patnaik,Sanat Sharma,Shinji Watanabe,Anuj Kumar,Ahmed Aly,Yue Liu,Florian Metze,Zhaojiang Lin

Main category: cs.CL

TL;DR: 本文提出了一种用于端到端语音对话系统的流式检索增强生成（Streaming RAG）框架，通过在用户说话的同时预测工具查询，显著降低延迟并提高问答准确率。

Details

Motivation: 现有的语音对话系统容易因缺乏事实依据而产生幻觉，虽然文本系统可通过工具调用来缓解此问题，但直接在语音系统中集成工具会增加响应延迟，影响对话流畅性。因此，需要一种能减少感知延迟的工具使用方法。 Method: 提出Streaming RAG框架，利用后训练流程使模型能在用户语音输入过程中动态决定何时发起工具调用，并将检索到的文本结果与音频输入融合生成口语化回复；同时构建AudioCRAG基准用于评估。 Result: 实验显示该方法将问答准确率相对提升200%（绝对值从11.1%提升至34.2%），工具使用延迟减少20%；且该方法对输入模态不敏感，同样适用于文本输入。 Conclusion: Streaming RAG有效解决了语音对话系统中工具调用带来的延迟问题，在提升事实准确性的同时保障了交互实时性，为实现实时、具身的AI助手提供了可行路径。 Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Hala Sheta,Eric Huang,Shuyu Wu,Ilia Alenabi,Jiajun Hong,Ryker Lin,Ruoxi Ning,Daniel Wei,Jialin Yang,Jiawei Zhou,Ziqiao Ma,Freda Shi

Main category: cs.CL

TL;DR: VLM-Lens是一个用于系统评估、分析和解释视觉语言模型（VLM）的开源工具包，支持提取任意层的中间输出，提供统一且可配置的接口，兼容多种主流VLM及其变体。

Details

Motivation: 为了促进对视觉语言模型内部机制的理解，需要一个能够统一、灵活地提取和分析模型中间表示的工具，而现有方法往往模型特定且难以扩展。 Method: 设计并实现了一个名为VLM-Lens的工具包，通过抽象模型复杂性，提供YAML配置接口，支持在前向传播过程中从任意层提取中间输出，并集成多种可解释性分析方法。 Result: VLM-Lens目前支持16个先进的基础VLM及其30多个变体，展示了跨层和跨概念的隐藏表示差异，并可通过简单扩展支持新模型。 Conclusion: VLM-Lens为研究社区提供了一个强大且易用的工具，有助于加速对视觉语言模型的理解与改进。 Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Siddhant Arora,Jinchuan Tian,Hayato Futami,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe

Main category: cs.CL

TL;DR: 提出了一种名为SCoT的流式思维链（CoT）框架，用于双工语音对话系统，通过分块处理用户输入并生成响应，提升语义连贯性和交互实时性。

Details

Motivation: 现有端到端对话系统依赖语音活动检测（VAD）进行轮次切换，但难以区分停顿与话语结束；双工模型虽避免VAD但结构复杂且语义推理能力弱。 Method: 设计SCoT框架，采用流式思维链机制，按固定时长分块处理输入，并利用帧级对齐生成中间目标对齐的转录和响应，实现块间交替处理。 Result: 实验表明，该方法相比现有双工方法生成更连贯、可解释的响应，且支持更低延迟和重叠交互。 Conclusion: SCoT在保持双工交互优势的同时提升了语义推理能力和响应质量，为高效、自然的语音对话系统提供了新思路。 Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

[78] The Disparate Impacts of Speculative Decoding

Jameson Sandler,Ahmet Üstün,Marco Romanelli,Sara Hooker,Ferdinando Fioretto

Main category: cs.CL

TL;DR: 本文研究了推测解码在不同任务间的加速效果差异，发现其对欠拟合和代表性不足的任务加速效果较差，并提出了一种缓解策略，在多个模型对上平均提升了12%的公平性指标。

Details

Motivation: 推测解码虽能系统性减少大语言模型的解码时间，但其在不同任务上的加速效果可能存在不均衡，尤其是对欠拟合或代表性不足的任务可能不公平，因此需要分析并改善这种不公平现象。 Method: 通过理论分析量化推测解码带来的“不公平”现象，并识别导致加速差异的关键因素；基于这些洞察提出一种缓解策略以减少加速差异。 Result: 实验证明推测解码的加速效果在不同任务上不均衡，提出的缓解策略在多个模型对上平均使公平性指标提升12%。 Conclusion: 推测解码存在任务间的加速不公平问题，所提出的策略可有效缓解该问题，提升整体推理效率的公平性。 Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

[79] RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu,Will Su,Leitian Tao,Haozhu Wang,Aashu Singh,Hanchao Yu,Jianyu Wang,Hongyang Gao,Weizhe Yuan,Jason Weston,Ping Yu,Jing Xu

Main category: cs.CL

TL;DR: RESTRAIN是一种无需黄金标签的自惩罚强化学习框架，通过利用模型答案分布中的信号（如惩罚过度自信和低一致性样本）实现自我改进，在多个复杂推理基准上显著提升性能。

Details

Motivation: 现有基于人类标注数据的强化学习在提升大模型推理能力方面成本高昂且难以应对更难任务，因此需要一种无需 curated 标签、能从无标签数据中持续学习的方法。 Method: 提出RESTRAIN框架，利用模型自身生成的答案分布作为学习信号，对过度自信的生成路径和低一致性的样例进行自惩罚，并结合GRPO等策略优化方法实现无监督下的持续自我优化。 Result: 在AIME25、MMLU_STEM和GPQA-Diamond等多个挑战性推理基准上，使用无标签数据即分别提升了Pass@1指标达+140.7%、+36.2%和+19.6%，性能接近使用黄金标签训练的效果。 Conclusion: RESTRAIN为不依赖黄金标签的大规模推理模型自我提升提供了可扩展的有效路径。 Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

[80] Learning to Reason for Hallucination Span Detection

Hsuan Su,Ting-Yao Hu,Hema Swetha Koppula,Kundan Krishna,Hadi Pouransari,Cheng-Yu Hsieh,Cem Koc,Joseph Yitan Cheng,Oncel Tuzel,Raviteja Vemulapalli

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的框架RL4HS，用于检测大语言模型中的幻觉片段，通过引入细粒度奖励机制，在多个任务上优于预训练推理模型和监督微调方法。

Details

Motivation: 现有工作多将幻觉检测视为二分类任务，而实际应用需要识别具体的幻觉片段，因此需要更精细的检测方法。 Method: 提出RL4HS框架，结合链式思维（CoT）推理与强化学习，采用基于组相对策略优化和类感知策略优化的细粒度奖励机制，提升幻觉片段检测效果。 Result: 在RAGTruth基准（摘要、问答、数据到文本）上的实验表明，RL4HS优于预训练推理模型和监督微调方法，验证了细粒度奖励对幻觉片段检测的重要性。 Conclusion: 强化学习配合细粒度奖励能有效提升大语言模型中幻觉片段的检测性能，是解决该复杂任务的关键。 Abstract: Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

[81] ARUQULA -- An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities

Felix Brei,Lorenz Bühmann,Johannes Frey,Daniel Gerber,Lars-Peter Meyer,Claus Stadler,Kirill Bulert

Main category: cs.CL

TL;DR: 本文提出了一种基于SPINACH的通用方法，通过大语言模型将自然语言问题迭代地转换为SPARQL查询，以降低知识图谱查询的门槛。

Details

Motivation: 为了降低非计算机专业人员使用SPARQL查询知识图谱的难度，并响应Text2SPARQL挑战赛以推动该领域的发展。 Method: 采用基于大语言模型的代理（SPINACH），通过探索与执行的迭代过程，而非单次生成，实现自然语言到SPARQL的翻译。 Result: 描述了系统的整体架构和设计思路，并对代理行为进行了深入分析，揭示了未来可改进的方向。 Conclusion: 该方法能有效支持自然语言到SPARQL的转换，迭代式处理方式有助于提升查询准确性，为Text2SPARQL技术的发展提供了有价值的见解。 Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.

[82] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Lingzhong Dong,Ziqi Zhou,Shuaibo Yang,Haiyue Sheng,Pengzhou Cheng,Zongru Wu,Zheng Wu,Gongshen Liu,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的评估框架，用于诊断移动使用代理中的推理-执行差距，核心是通过“真实对齐”（GTA）指标衡量链式思维（CoT）推理与真实动作的一致性，结合精确匹配（EM）指标，揭示了执行差距和推理差距的存在，并发现即使在大规模模型中仍存在显著的执行差距。

Details

Motivation: 现有研究关注执行准确性，但忽视了CoT推理是否与真实动作对齐，可能导致用户过度信任看似合理的推理而授权有害操作，从而引发安全风险。 Method: 提出Ground-Truth Alignment (GTA) 指标，结合Exact Match (EM)，联合评估推理准确性和执行准确性，识别两类差距：执行差距（EG）和推理差距（RG）。 Result: 实验表明推理-执行差距普遍存在，执行差距多于推理差距；增大模型规模可缩小整体差距，但大型模型仍存在明显执行差距；该框架能可靠反映前沿模型中的系统性EG/RG模式。 Conclusion: 所提出的评估框架有助于诊断移动使用代理中的推理-执行不一致问题，为构建更可信的智能代理提供了具体支持。 Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

[83] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Xiaoyang Yuan,Yujuan Ding,Yi Bin,Wenqi Shao,Jinyu Cai,Jingkuan Song,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: 提出了一种名为AMPO的自适应多引导策略优化框架，通过在需要时从多个教师模型获取指导，提升大语言模型的推理能力和泛化性能。

Details

Motivation: 现有强化学习方法依赖单一教师或自我探索生成长思维链，易引入模型偏差并限制探索多样性，从而影响推理效果。 Method: 设计了一种‘按需指导’机制，仅在模型无法正确求解时才从多个熟练教师模型中自适应地获取引导，并结合基于理解能力的选择机制，选择学生最可能理解的推理路径进行学习。 Result: 实验表明，AMPO在数学推理任务上比强基线GRPO提升4.3%，在分布外任务上提升12.2%，显著提高Pass@k性能和探索多样性；使用四个同规模教师即可达到使用更强单个教师（如DeepSeek-R1）的相当效果。 Conclusion: AMPO提供了一条更高效、可扩展的路径来提升大语言模型的推理能力与泛化性，验证了多教师引导在强化学习中的优势。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

[84] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches

Ebtesam Jaber Aljohani,Wael M. S. Yafoo

Main category: cs.CL

TL;DR: 本文旨在提升阿拉伯语网络欺凌内容的检测效果，构建了包含10,662条X平台帖子的数据集，并采用多种深度学习模型进行实验，结合不同词嵌入方法和预训练BERT模型，最终Bi-LSTM与FastText结合的方法达到98%的准确率。

Details

Motivation: 阿拉伯语网络欺凌检测研究稀缺，而社交媒体对青少年情感健康构成威胁，亟需有效的自动化检测方法。 Method: 收集并预处理10,662条阿拉伯语X帖子数据，使用kappa工具确保标注质量；实验对比了LSTM、Bi-LSTM结合多种词嵌入（如FastText）以及与预训练BERT融合的模型在检测任务中的表现。 Result: LSTM-BERT和Bi-LSTM-BERT模型准确率达到97%，Bi-LSTM结合FastText词嵌入表现最佳，准确率达98%。 Conclusion: 结合Bi-LSTM与FastText的模型在阿拉伯语网络欺凌检测中表现最优，具有良好的泛化能力，为后续相关研究提供了有效方案。 Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize

[85] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Linh The Nguyen,Chi Tran,Dung Ngoc Nguyen,Van-Cuong Pham,Hoang Ngo,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: 提出AccurateRAG框架，用于构建高性能的基于检索增强生成（RAG）的问答系统，在多个基准数据集上达到最先进的性能。

Details

Motivation: 提升基于RAG的问答系统的性能和开发效率，解决现有方法在效果和部署上的局限性。 Method: 构建了一个完整的开发流程，包括原始数据集处理、微调数据生成、文本嵌入与大模型微调、输出评估以及本地RAG系统搭建。 Result: 实验结果显示该框架优于先前的强基线方法，并在多个基准问答数据集上取得了新的最先进性能。 Conclusion: AccurateRAG框架有效提升了RAG系统的性能和开发效率，具备实际应用价值。 Abstract: We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

[86] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Tianyi Jiang,Yi Bin,Yujuan Ding,Kainian Zhu,Fei Ma,Jingkuan Song,Heng Tao Shen

Main category: cs.CL

TL;DR: 提出一种新的推理范式“先探索，后决定”及TECA指标和CER机制，有效缓解大模型在简单问题上的过度推理，提升推理效率。

Details

Motivation: 大语言模型在简单问题上常出现过度推理，导致效率低下，难以根据问题复杂度自适应调整推理深度。 Method: 引入Token Entropy Cumulative Average（TECA）作为衡量推理过程中探索程度的指标，并结合累积熵调节（CER）机制，动态判断最佳终止点以结束推理。 Result: 在多个数学基准上的实验表明，该方法显著减少过度推理，平均响应长度在简单数据集上最多降低71%，且不损害解题能力。 Conclusion: 所提出的推理范式能有效实现高效、自适应的推理过程，平衡推理长度与性能，提升模型效率。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

[87] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du,Yuanshuo Zhang,Xiyuan Yang,Yifan Zhou,Cheng Wang,Gongyi Zou,Xianghe Pang,Wenhao Wang,Menglan Chen,Shuo Tang,Zhiyu Li,Siheng Chen

Main category: cs.CL

TL;DR: 本文提出了InfoMosaic-Bench，首个用于评估工具增强型代理在多源信息获取能力的基准，涵盖医学、金融、地图等六个领域，任务需结合通用搜索与领域专用工具。通过InfoMosaic-Flow生成可靠且非平凡的任务，实验显示当前LLM代理在工具使用上仍存在显著缺陷。

Details

Motivation: 现有LLM代理依赖开放网络搜索，存在信息噪声大、缺乏特定领域知识的问题；尽管MCP协议使代理可接入专业工具，但其有效整合能力尚不明确，因此需要新基准来评估多源信息融合能力。 Method: 提出InfoMosaic-Bench基准和InfoMosaic-Flow任务生成流程，前者覆盖六个领域并要求结合通用搜索与专用工具，后者通过验证工具输出、强制跨源依赖和过滤简单案例来保证任务质量。 Result: 14种最先进LLM代理实验表明：仅靠网页信息效果有限（GPT-5准确率38.2%）；领域工具带来选择性但不稳定的优势；22.4%失败源于工具误用或误选。 Conclusion: 当前工具增强型LLM代理在多源信息整合方面仍有重大挑战，尤其在正确选择和使用工具方面表现不足，需进一步改进。 Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

[88] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

Wen Yang,Junhong Wu,Chong Li,Chengqing Zong,Jiajun Zhang

Main category: cs.CL

TL;DR: 本研究从跨语言视角探讨了基于强化学习的推理泛化能力，发现英语为中心的大规模推理模型在跨语言迁移中表现差异显著，并提出量化指标与平行训练方法，揭示了从单语到多语迁移的性能跃升及幂律扩展规律。

Details

Motivation: 现有研究主要关注强化后训练在任务或模态间的泛化，但其在语言间的推理能力迁移尚不明确。本文旨在探究英语推理能力是否能有效迁移到其他语言，并分析影响因素。 Method: 通过在多语言推理基准上系统评估以英语为中心的大型推理模型，引入衡量跨语言可迁移性的新指标，并开展干预性和平行训练研究，分析不同模型、语言和训练范式下的迁移效果。 Result: 发现跨语言迁移能力受初始模型、目标语言和训练范式显著影响；英语能力强的模型更依赖英语特定模式，导致跨语言泛化下降；引入单一平行语言即带来显著性能提升（First-Parallel Leap）；跨语言迁移遵循关于平行语言数量的幂律（Parallel Scaling Law）；并提出Monolingual Generalization Gap概念，反映单语性能与预测之间的差距。 Conclusion: 大规模推理模型的推理能力不能自动泛化到非英语语言，挑战了其类人认知的假设；需发展更语言中立的模型，强调多语言平行数据在训练中的重要性。 Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.

[89] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM是一系列从基础模型直接微调而来的高效嵌入模型，具有三种规模（0.6B、1.7B、4B），在降低训练成本的同时实现了优异的性能表现。

Details

Motivation: 现有高性能嵌入模型依赖大规模对比预训练、复杂训练流程和昂贵的合成数据，训练成本高且难以复现，因此需要一种更经济、简洁且可复现的替代方案。 Method: F2LLM通过在600万来自开源非合成数据集的查询-文档-负样本三元组上直接微调基础大模型获得，避免了复杂的预训练和合成数据生成过程。 Result: 在MTEB英文排行榜上，F2LLM-4B在约4B参数模型中排名第二，整体第七；F2LLM-1.7B在其参数规模区间（1B-2B）中排名第一。 Conclusion: F2LLM在训练成本、模型大小和嵌入性能之间取得了良好平衡，是一个强大、可复现且预算友好的基线模型，有助于推动未来嵌入模型的研究。 Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

[90] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Raphael Tang,Crystina Zhang,Wenyan Li,Carmen Lai,Pontus Stenetorp,Yao Lu

Main category: cs.CL

TL;DR: 本文质疑了在大语言模型竞技场评估中将平局视为模型能力相等的传统做法，提出平局更多反映的是问题的难度而非模型水平相近。实验表明，在三个真实数据集上忽略平局的评分更新可提升1-3%的胜负预测准确率，并发现平局更常出现在简单或客观性强的问题中。建议未来的评分系统应重新考虑平局的语义并纳入问题属性的影响。

Details

Motivation: 现有的LLM评估体系（如Elo及其变体）将平局视为两模型能力相等，从而调整其评分为接近值。但作者质疑这一假设是否合理，认为平局可能更多由查询本身的易解性或客观性导致，而非模型能力相近，因此需要重新审视平局在评分系统中的处理方式。 Method: 作者分析了三个真实世界的大语言模型竞技场数据集，比较了四种主流评分系统在是否忽略平局带来的评分更新时的表现，使用战斗结果预测准确性作为评价指标，并通过统计分析探究平局与查询难度、主观性等属性之间的关联。 Result: 实验结果显示，忽略平局带来的评分更新可在所有四种评分系统中带来1%-3%的相对预测准确率提升；进一步分析表明，非常容易和高度客观的查询更容易产生平局，风险比分别为1.37和1.35。 Conclusion: 平局不应被简单解释为模型能力相等，而应被视为反映查询难度的信号。当前基于Elo的评分系统对平局的处理可能导致误导性的评级更新，未来的工作应在评分机制中考虑查询本身的特性，以提高评估的准确性。 Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

cs.CV [Back]

[91] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

Alessio Spagnoletti,Andrés Almansa,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本文提出了LVTINO，首个基于视频一致性模型（VCM）的零样本高清晰度视频恢复逆问题求解器，通过利用VCM捕捉时间因果性，在保持测量一致性和平滑帧间过渡的同时，实现了最先进的视频重建质量与高效计算。

Details

Motivation: 现有的基于图像的潜在扩散模型在逐帧处理视频时难以保持时间一致性，导致重建结果出现闪烁或不连贯，因此需要一种能够同时恢复精细空间细节并建模细微时间依赖性的方法。 Method: 提出LVTINO，利用最新视频一致性模型（VCM）作为先验，通过无需自动微分的条件机制，在少量神经网络函数评估下实现高效推理，并显式建模时间因果关系以保证帧间一致性。 Result: 在多个视频逆问题任务上实验表明，LVTINO显著优于当前逐帧应用图像LDM的方法，取得了更好的感知质量、重建保真度和计算效率，建立了新的性能基准。 Conclusion: LVTINO首次将零样本VCM先验成功应用于高分辨率视频恢复，有效解决了时间不一致问题，为未来视频级生成模型在计算成像中的应用提供了新方向。 Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.

[92] Image Generation Based on Image Style Extraction

Shuochen Chang

Main category: cs.CV

TL;DR: 提出一种基于三阶段训练的风格提取图像生成方法，利用单一样式参考图像实现细粒度控制的文本到图像生成。

Details

Motivation: 现有文本到图像生成模型难以精确描述和控制细粒度样式，且参考图像的风格信息难以与文本条件对齐。 Method: 设计风格编码器和风格投影层，通过三阶段训练从单个参考图像中提取细粒度风格表示，并将其注入生成模型，保持下游模型结构不变。 Result: 实现了基于文本提示的细粒度风格引导图像生成，并构建了Style30k-captions数据集用于训练和验证。 Conclusion: 该方法能有效提升预训练生成模型在细粒度风格控制下的生成能力，无需修改下游生成模型结构。 Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

[93] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels

Shijia Feng,Michael Wray,Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: 本文提出一个名为EvoStruggle的新数据集，包含61.68小时的视频记录，用于研究技能学习过程中“挣扎”的演变，并将“挣扎判定”定义为时序动作定位任务，实验表明现有模型可在未见任务上检测挣扎，具备一定跨任务迁移能力。

Details

Motivation: 为了优化人类学习并开发有效的辅助系统，需要理解个体在技能习得过程中何时以及如何挣扎，而现有操作数据集缺乏对挣扎随时间演变的关注。 Method: 收集了一个包含76名参与者、18个任务（分为打结、折纸、七巧板和洗牌四类活动）的数据集，每人重复任务五次；将挣扎判定建模为时序动作定位问题，使用Temporal Action Localization模型进行实验。 Result: 模型在跨任务和跨活动上的平均mAP分别为34.56%和19.24%，表明挣扎检测具有一定的可迁移性，但仍存在挑战。 Conclusion: 挣扎是一种可在不同技能任务间迁移的概念，该数据集为未来研究提供了基础，推动个性化学习辅助系统的发展。 Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.

[94] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs

Abu Bucker Siddik,Diane Oyen,Alexander Most,Michal Kucer,Ayan Biswas

Main category: cs.CV

TL;DR: 提出了一种名为Small PDE U-Net Solver (SPUS)的轻量级基础模型，采用残差U-Net架构和自回归预训练策略，可在多种偏微分方程求解任务中实现最先进的泛化性能，且参数效率显著优于现有方法。

Details

Motivation: 现有的PDE基础模型多基于复杂Transformer架构，计算和参数开销大，缺乏高效轻量的统一神经算子模型。 Method: 设计基于残差U-Net的轻量架构，结合模仿数值求解器行为的自回归预训练策略，在多样化流体动力学PDE上进行预训练，并在未见下游任务上评估泛化能力。 Result: SPUS在6个跨物理系统的挑战性下游PDE任务上表现出SOTA的泛化性能，同时参数量更少，所需微调数据极少。 Conclusion: SPUS是一种高效、轻量且通用的PDE求解基础模型，展示了U-Net架构在神经算子领域的巨大潜力，适用于资源受限场景下的多物理场建模。 Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.

[95] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

Shubhankar Borse,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: DisCo是一种基于强化学习的框架，通过多样性约束优化多人体生成中的身份多样性，解决了现有文本到图像模型在生成多个人物时身份混淆的问题。

Details

Motivation: 现有的文本到图像模型在处理多个人物生成时存在面部重复、身份混淆和人数错误等问题，缺乏对身份多样性的有效控制。 Method: 提出DisCo框架，采用Group-Relative Policy Optimization（GRPO）微调流匹配模型，结合包含面部相似性惩罚、跨样本身份重复抑制、人数准确性和视觉保真度的复合奖励函数，并通过单阶段课程学习稳定训练过程。 Result: 在DiverseHumans测试集上，DisCo实现了98.6%的独特面部准确率和接近完美的全局身份分布，优于开源和专有方法（如Gemini、GPT-Image），同时保持了良好的视觉质量。 Conclusion: DisCo是一种可扩展且无需额外标注的解决方案，有效解决了生成模型中的身份危机问题，为多人体组合生成设定了新基准。 Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

[96] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna,Nicholas Meegan,Han-Pang Chiu,Supun Samarasekera,Rakesh Kumar

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉地理定位方法，通过将查询图像的视觉表示与层次化的地理嵌入表示对齐，结合外观特征和语义分割图来提升定位性能，在五个基准数据集上25项指标中的22项超越了现有最先进方法。

Details

Motivation: 现有的视觉地理定位方法在学习地理表示方面仍有不足，需要更有效的地理建模方式以提升全球范围内的定位精度。 Method: 将地理定位问题形式化为视觉表示与学习到的地理表示的对齐问题；提出层次化地理嵌入表示，并融合查询图像的外观特征与其语义分割图以构建鲁棒的视觉表示。 Result: 在五个基准数据集的25项指标中，有22项取得了当前最优性能，优于先前的SOTA方法和最新的大型视觉语言模型（LVLMs）；消融实验表明性能提升主要来自地理与视觉表示的结合。 Conclusion: 所提出的层次化地理表示与多模态视觉融合策略显著提升了全球视觉地理定位的准确性，验证了联合建模地理结构与视觉内容的有效性。 Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

Nilay Naharas,Dang Nguyen,Nesihan Bulut,Mohammadhossein Bateni,Vahab Mirrokni,Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: 本文提出了XMAS，一种用于大型视觉-语言模型（LVLM）高效数据训练的新型方法。通过分析跨模态注意力矩阵的相似性，XMAS聚类并去除了冗余训练样本，在保留完整性能的同时显著减少了数据量和训练时间。

Details

Motivation: 现有的数据选择方法在大型视觉-语言模型（LVLM）上表现不佳，甚至无法超越随机选择。本文旨在提出一种有理论依据的方法，以实现LVLM上的高效指令微调，减少训练数据冗余。 Method: 基于理论证明：在指令微调过程中具有相似跨模态注意力矩阵的样本具有相似梯度，因此对模型参数的影响相似。XMAS利用一个小的代理LVLM，在微调过程中提取注意力矩阵的前几个奇异值轨迹进行示例聚类，并从各簇中均衡采样以构建训练子集。 Result: 实验表明，XMAS可在LLaVA-665k数据集中去除50%的数据、在Vision-Flan数据集中去除85%的数据，同时在10个下游任务上完全保持LLaVA-1.5-7B的性能，并将训练速度提升1.2倍。相比最佳基线方法，在LLaVA-665k上多实现了30%的数据缩减。 Conclusion: XMAS是首个针对LVLM进行高效指令微调的原则性方法，通过基于注意力矩阵奇异值轨迹的聚类有效识别并消除数据冗余，显著提升了训练效率且不牺牲模型性能。 Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

[98] Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan,Vincent Tao Hu,Grigory Bartosh,Björn Ommer,Cees G. M. Snoek,Max Welling,Jan-Willem van de Meent,Mohammad Mahdi Derakhshani,Floor Eijkelboom

Main category: cs.CV

TL;DR: Purrception是一种用于向量量化图像生成的变分流匹配方法，通过在连续嵌入空间中计算速度场的同时学习码本索引的分类后验，实现了离散监督与连续传输动力学的结合。

Details

Motivation: 现有的图像生成方法在连续流匹配和离散流匹配之间存在效率和性能的权衡，缺乏有效融合两者优势的方法。 Method: 将变分流匹配应用于向量量化潜在空间，学习码本索引的分类后验分布，并在连续嵌入空间中计算速度场，实现对生成过程的显式类别监督和不确定性量化。 Result: 在ImageNet-1k 256x256图像生成任务上，Purrception比连续和离散流匹配基线训练收敛更快，并取得了与当前最先进模型相当的FID分数。 Conclusion: 变分流匹配能够有效结合连续传输动力学和离散监督，在保持生成质量的同时显著提升训练效率。 Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

[99] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging

Yuxuan Ou,Ning Bi,Jiazhen Pan,Jiancheng Yang,Boliang Yu,Usama Zidan,Regent Lee,Vicente Grau

Main category: cs.CV

TL;DR: 提出一种基于条件扩散模型和多任务学习的统一框架，用于从非对比CT生成合成对比增强CT并同时分割主动脉管腔和血栓，减少造影剂使用并提高分割精度。

Details

Motivation: 减少碘造影剂的使用及其带来的肾毒性、过敏反应和环境危害，同时克服传统多阶段方法中误差累积和语义结构利用不足的问题。 Method: 结合条件扩散模型与多任务学习，实现图像生成与解剖分割的端到端联合优化，共享编码器和解码器参数，并采用半监督训练策略以利用缺失标注的真实临床数据。 Result: 在264名患者数据上验证，图像合成PSNR达25.61 dB，管腔和血栓分割Dice分数分别提升至0.89和0.53，临床测量误差显著降低。 Conclusion: 该统一框架在合成图像质量和解剖分割精度方面均优于现有方法，有助于减少造影剂依赖并提升腹部主动脉瘤的临床评估效果。 Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.

[100] From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Basem Rizk,Joel Walsh,Mark Core,Benjamin Nye

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态内容分析的框架，能够高效地构建视频分析管道，并将视频转换为可查询的、支持持续学习的帧级知识图谱表示。

Details

Motivation: 多模态内容分析复杂、计算成本高且工程实现困难，尤其是将预训练模型应用于视频等复杂数据时面临挑战。 Method: 结合多个预训练模型构建候选管道，将视频转化为时间性的半结构化数据，并进一步转换为帧级索引的知识图谱表示，支持查询和交互式持续学习。 Result: 实现了高效的多模态分析原型系统，能够将视频内容转化为结构化的知识图谱并支持动态更新。 Conclusion: 该框架有效降低了多模态视频分析的复杂性，提供了可扩展、可查询且支持持续学习的知识表示方法。 Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

[101] WALT: Web Agents that Learn Tools

Viraj Prabhu,Yutong Dai,Matthew Fernandez,Jing Gu,Krithika Ramakrishnan,Yanqi Luo,Silvio Savarese,Caiming Xiong,Junnan Li,Zeyuan Chen,Ran Xu

Main category: cs.CV

TL;DR: WALT 是一种新型网页代理框架，通过逆向工程将网站功能抽象为可复用的工具，从而实现更鲁棒、高效的浏览器自动化。

Details

Motivation: 现有网页代理方法依赖细粒度UI操作和大量LLM推理，在动态布局和长程任务中表现脆弱；而人类则利用网站提供的搜索、过滤等高级功能高效操作，因此需要一种更接近人类方式的自动化方法。 Method: 提出WALT框架，通过分析网站行为和结构，自动发现并封装可调用的高阶工具（如search、filter、create等），使代理直接调用这些工具而非执行低级点击输入操作。 Result: 在VisualWebArena和WebArena上，WALT相比现有方法用更少步骤、更低LLM开销实现了更高的任务成功率。 Conclusion: WALT通过抽象网站内置功能为可靠工具，显著提升了网页代理的鲁棒性与效率，为浏览器自动化提供了可泛化的新范式。 Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

[102] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

Meilong Xu,Xiaoling Hu,Shahira Abousamra,Chen Li,Chao Chen

Main category: cs.CV

TL;DR: 提出了一种半监督分割框架，通过多扰动预测和拓扑一致性约束，有效减少组织病理学图像中的拓扑错误，提升分割鲁棒性和准确性。

Details

Motivation: 在组织病理学图像中，无标签数据的语义结构提取困难，尤其是对象密集分布时，现有方法难以保持有意义的拓扑特征。 Method: 利用随机dropout和时间训练快照生成多个扰动预测，通过结合空间重叠与全局结构对齐的新匹配策略，强制跨预测的拓扑一致性。 Result: 实验表明该方法显著减少了拓扑错误，提升了分割的准确性和鲁棒性，优于现有半监督方法。 Conclusion: 所提出的框架能有效识别并保留关键拓扑特征，提高了无标签数据下的分割性能，适用于复杂的组织病理学图像分析。 Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.

[103] Towards Better Optimization For Listwise Preference in Diffusion Models

Jiamu Bai,Xin Yu,Meilong Xu,Weitao Lu,Xin Pan,Kiwan Maeng,Daniel Kifer,Jian Wang,Yu Wang

Main category: cs.CV

TL;DR: 本文提出了Diffusion-LPO，一种用于扩散模型中列表级偏好优化的简单有效框架，通过引入Plackett-Luce模型扩展DPO目标，利用排序数据提升图像生成质量与人类偏好对齐。

Details

Motivation: 现有的基于人类反馈的强化学习方法在处理扩散模型时主要依赖成对偏好，难以充分利用隐含的排序信息；而实际中的人类反馈通常包含更精细的排序结构，因此需要一种能够精确优化列表级偏好的方法。 Method: 提出Diffusion-LPO框架，将用户反馈聚合为按排名排列的图像列表，并基于Plackett-Luce模型推导出DPO目标的列表级扩展，强制样本优于其所有排名较低的替代项以保持整体排序一致性。 Result: 在文本到图像生成、图像编辑和个性化偏好对齐等多个任务上验证了Diffusion-LPO的有效性，结果表明其在视觉质量和偏好对齐方面均优于成对DPO基线方法。 Conclusion: Diffusion-LPO通过有效利用列表级人类偏好信息，在不增加复杂性的情况下显著提升了扩散模型的对齐性能，为未来基于排序反馈的模型优化提供了新方向。 Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

[104] Growing Visual Generative Capacity for Pre-Trained MLLMs

Hanyu Wang,Jiaming Han,Ziyan Yang,Qi Zhao,Shanchuan Lin,Xiangyu Yue,Abhinav Shrivastava,Zhenheng Yang,Hao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Bridge的纯自回归统一多模态大语言模型，通过Mixture-of-Transformers架构和语义到像素的离散表示，在单个框架下实现了图像理解和生成，兼顾语义对齐与像素质量，且训练效率更高。

Details

Motivation: 现有的多模态大语言模型在统一理解和生成任务时面临挑战：混合方法破坏了自回归结构，而纯自回归方法则在语义对齐和像素保真度之间存在权衡。因此，需要一种既能保持自回归特性又能高效实现双任务的统一模型。 Method: 提出Bridge模型，采用Mixture-of-Transformers架构，将预训练的视觉理解模型扩展为具备生成能力；设计语义到像素的离散表示，结合紧凑的语义标记与细粒度像素标记，统一在下一个标记预测框架中进行图像理解和生成。 Result: 在多个多模态基准测试中，Bridge在理解和生成任务上均达到领先或具有竞争力的结果，同时序列长度仅增加7.9%，所需训练数据更少、训练时间更短。 Conclusion: Bridge成功实现了纯自回归方式下的统一多模态理解和生成，有效平衡了语义对齐与像素级保真度，且具备更高的训练和推理效率，为统一多模态模型提供了新方向。 Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

[105] Robust Classification of Oral Cancer with Limited Training Data

Akshay Bhagwan Sonawane,Lena D. Swamikannan,Lakshman Tamil

Main category: cs.CV

TL;DR: 提出一种结合CNN与贝叶斯深度学习的混合模型，用于小样本下的口腔癌分类，通过变分推断实现不确定性量化，显著提升模型在数据稀缺场景下的可靠性与泛化能力。

Details

Motivation: 传统深度学习模型在小样本下易过拟合且缺乏可靠性，难以满足医疗资源匮乏地区早期口腔癌诊断的需求。 Method: 结合卷积神经网络（CNN）与贝叶斯深度学习，采用变分推断进行不确定性量化，使用智能手机拍摄的彩色照片训练模型，并在三个不同测试集上评估性能。 Result: 在训练分布相似的测试集上达到94%准确率；在真实世界多样数据上达到88%准确率，显著优于传统CNN的72.94%；且模型对错误分类样本表现出高不确定性，具备良好的置信度校准能力。 Conclusion: 贝叶斯深度学习能有效提升小样本下口腔癌分类模型的可靠性与泛化性能，适用于医疗资源有限地区的早期筛查。 Abstract: Oral cancer ranks among the most prevalent cancers globally, with a particularly high mortality rate in regions lacking adequate healthcare access. Early diagnosis is crucial for reducing mortality; however, challenges persist due to limited oral health programs, inadequate infrastructure, and a shortage of healthcare practitioners. Conventional deep learning models, while promising, often rely on point estimates, leading to overconfidence and reduced reliability. Critically, these models require large datasets to mitigate overfitting and ensure generalizability, an unrealistic demand in settings with limited training data. To address these issues, we propose a hybrid model that combines a convolutional neural network (CNN) with Bayesian deep learning for oral cancer classification using small training sets. This approach employs variational inference to enhance reliability through uncertainty quantification. The model was trained on photographic color images captured by smartphones and evaluated on three distinct test datasets. The proposed method achieved 94% accuracy on a test dataset with a distribution similar to that of the training data, comparable to traditional CNN performance. Notably, for real-world photographic image data, despite limitations and variations differing from the training dataset, the proposed model demonstrated superior generalizability, achieving 88% accuracy on diverse datasets compared to 72.94% for traditional CNNs, even with a smaller dataset. Confidence analysis revealed that the model exhibits low uncertainty (high confidence) for correctly classified samples and high uncertainty (low confidence) for misclassified samples. These results underscore the effectiveness of Bayesian inference in data-scarce environments in enhancing early oral cancer diagnosis by improving model reliability and generalizability.

[106] Consistent Assistant Domains Transformer for Source-free Domain Adaptation

Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,and Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CADTrans的源域无访问域自适应方法，通过构建一致性辅助域和多核最大均值差异策略，有效提升目标域上的模型性能。

Details

Motivation: 由于无法获取源域数据，现有方法难以获得确定的不变特征，且易受难样本和域偏移影响。因此，需要一种能够在无源域数据情况下有效提取不变特征并区分难易样本的方法。 Method: 提出CADTrans模型，包含辅助域模块以从中间聚合全局注意力中获取多样化表征；采用多种一致性策略获取不变特征表示；设计条件多核最大均值差异（CMK-MMD）策略对齐难样本与易样本。 Result: 在Office-31、Office-Home、VISDA-C和DomainNet-126等多个基准上进行了广泛实验，结果表明所提方法显著优于现有方法。 Conclusion: CADTrans通过构造一致性辅助域和有效的特征对齐策略，在源域不可见的情况下实现了优越的域自适应性能，具备较强的鲁棒性和泛化能力。 Abstract: Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.

[107] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Ricardo Gonzalez Penuela,Felipe Arias-Russi,Victor Capriles

Main category: cs.CV

TL;DR: 提出一种基于历史问题的上下文感知系统，通过利用盲人和低视力用户的历史视觉问题数据（VizWiz-LF），引导多模态大语言模型生成更相关、简洁的图像描述，提升信息获取效率。

Details

Motivation: 现有MLLM在为盲人和低视力用户提供图像描述时通常生成冗长且缺乏上下文的相关性，导致信息获取效率低下。 Method: 利用VizWiz-LF数据集中盲人用户的过往视觉问题，在新图像输入时匹配相似的视觉上下文，并以此指导MLLM生成更具针对性的描述。 Result: 评估显示，76.1%的上下文感知描述成功预测并回答了用户问题，在54.4%的对比中优于传统无上下文描述。 Conclusion: 通过引入用户历史问题作为上下文引导，可显著提升MLLM在辅助视觉应用中的实用性和交互效率，使描述更贴合盲人用户的实际需求。 Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

[108] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Krishna Teja Chitty-Venkata,Murali Emani

Main category: cs.CV

TL;DR: ImageNet-Think是一个基于ImageNet21k的多模态推理数据集，包含25万张图像及其由先进VLM生成的结构化思维链和答案，旨在促进具备显式推理能力的视觉语言模型的发展。

Details

Motivation: 为了提升视觉语言模型（VLM）的显式推理能力，并深入理解多模态推理机制，需要一个大规模、结构化的推理数据集。 Method: 基于ImageNet21k的25万张图像，利用GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506两个先进的VLM生成每张图像对应的两组思维链-答案序列，构建出ImageNet-Think数据集。 Result: 成功构建了包含结构化思维标记和最终答案的ImageNet-Think数据集，可用于训练和评估多模态推理模型。 Conclusion: ImageNet-Think为开发更强大的具备推理能力的VLM提供了重要资源，并将通过公开数据集和基准推动该领域的研究。 Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

[109] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems

Roman Jacome,Romario Gualdrón-Hurtado,Leon Suarez,Henry Arguello

Main category: cs.CV

TL;DR: 提出一种名为非线性零空间投影（NPN）的新正则化方法，通过神经网络将解约束在感知矩阵零空间的低维投影中，提升多种成像反问题的重建性能。

Details

Motivation: 传统先验通常忽略零空间的任务特定结构，导致重建效果受限。为解决这一问题，需设计能利用零空间信息的新型正则化方法。 Method: 提出非线性零空间投影（NPN），利用神经网络学习感知矩阵零空间的低维结构，并将其作为先验融入重建过程，与现有方法兼容且可解释性强。 Result: 理论分析证明了收敛性和重建精度保证；实验表明NPN在压缩感知、去模糊、超分辨率、CT和MRI等多种任务中均能显著提升重建质量。 Conclusion: NPN通过挖掘零空间结构提供了一种可解释且灵活的正则化策略，有效增强各类成像反问题的重建能力。 Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix's null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.

[110] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics

Zijun Li,Jinchang Zhang,Ming Zhang,Guoyu Lu

Main category: cs.CV

TL;DR: 提出一种将DNA序列转化为可操作决策的自动化基因组解释模块，结合CGR和概念瓶颈模型，实现高精度、可解释的HIV亚型分类，并支持临床自动化决策。

Details

Motivation: 为了在基因组医学中实现可靠、可解释且能集成到医疗自动化系统中的基因组解读方法，解决现有模型缺乏生物学可解释性和临床实用性的不足。 Method: 采用混沌游戏表示法（CGR）提取序列特征，结合概念瓶颈模型（CBM），通过GC含量、CpG密度和k-mer等生物概念进行预测，并引入概念保真监督、先验一致性对齐、KL分布匹配和不确定性校准来提升可靠性，最后通过成本感知推荐层生成决策策略。 Result: 在内部和LANL数据集上实现了最先进的HIV亚型分类性能，显著优于基线模型；同时表现出更强的概念预测保真度、更好的不确定性校准以及更优的成本效益权衡。 Conclusion: 该工作构建了一个连接可解释基因组建模与自动化决策的可靠框架，为机器人化和临床基因组自动化提供了可行基础。 Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.

[111] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Angen Ye,Zeyu Zhang,Boyuan Wang,Xiaofeng Wang,Dapeng Zhang,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出了VLA-R1，一种增强推理能力的视觉-语言-动作（VLA）模型，结合可验证奖励的强化学习（RLVR）与组相对策略优化（GRPO），通过设计细粒度奖励提升推理与执行能力，并构建高质量数据集VLA-CoT-13K，实验证明其在多场景下具有优越的泛化与真实世界表现。

Details

Motivation: 现有VLA模型缺乏显式的逐步推理机制，且训练过程中对推理质量的强化不足，导致在动作生成中忽略物理约束和几何关系，影响跨任务与跨场景的泛化能力。 Method: 提出VLA-R1模型，采用基于可验证奖励的强化学习（RLVR）与组相对策略优化（GRPO）进行后训练，设计针对区域对齐、轨迹一致性和输出格式的可验证奖励，并构建包含思维链监督的高质量数据集VLA-CoT-13K以增强推理能力。 Result: 在领域内、领域外、仿真和真实机器人平台上广泛实验表明，VLA-R1在推理鲁棒性、执行准确性和真实环境性能上均优于先前方法，展现出更强的泛化能力。 Conclusion: 通过引入可验证奖励的强化学习与高质量思维链数据，VLA-R1有效提升了VLA模型的推理与执行协同能力，为具身智能中的跨模态决策提供了新思路。 Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

[112] Joint Deblurring and 3D Reconstruction for Macrophotography

Yifan Zhao,Liangchen Li,Yuqi Zhou,Kai Wang,Yan Liang,Juyong Zhang

Main category: cs.CV

TL;DR: 提出了一种用于微距摄影的联合去模糊和3D重建方法，通过多视角模糊图像共同优化物体的清晰3D模型和每个像素的离焦模糊核。

Details

Motivation: 微距摄影中离焦模糊问题严重影响成像清晰度和高质量3D重建，传统去模糊方法依赖大量图像和标注，且缺乏针对微距摄影的多视角3D重建方法。 Method: 基于多视角模糊图像输入，采用可微渲染方法联合优化物体的清晰3D模型与每像素的离焦模糊核，实现自监督优化。 Result: 实验表明，仅需少量多视角图像，该方法即可实现高质量图像去模糊并恢复高保真的3D外观。 Conclusion: 所提方法有效解决了微距摄影中的离焦模糊问题，在少输入图像条件下实现了优异的去模糊效果和高精度3D重建。 Abstract: Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.

[113] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring

Xiaoyang Liu,Zhengyan Zhou,Zihang Xu,Jiezhang Cao,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出FideDiff，一种新颖的单步扩散模型，用于高保真图像去模糊。通过将运动去模糊重构为扩散过程，并引入一致性训练和Kernel ControlNet等技术，实现了快速且高质量的去模糊效果。

Details

Motivation: 尽管预训练扩散模型在图像恢复中表现出色，但推理时间长和保真度不足限制了其应用。因此需要一种高效且高保真的去模糊方法。 Method: 将去模糊建模为扩散过程，每个时间步代表逐渐模糊的图像；训练一致性模型使所有时间步对齐到同一清晰图像；结合Kernel ControlNet进行模糊核估计，并引入自适应时间步预测。 Result: FideDiff在全参考指标上优于以往基于扩散的方法，性能媲美最先进的非扩散模型，实现快速单步去模糊。 Conclusion: FideDiff为预训练扩散模型在高保真图像恢复中的应用提供了新方向，建立了面向实际工业应用的强健基线。 Abstract: Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be available at https://github.com/xyLiu339/FideDiff.

[114] LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

Rixin Zhou,Peiqiang Qiu,Qian Zhang,Chuntao Li,Xi Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于LadderMoE增强的两阶段检测-识别管道，用于解决青铜器铭文自动识别中的跨域差异和长尾分布难题。

Details

Motivation: 青铜器铭文是早期汉字的重要形式，但其自动识别面临严重视觉退化、多模态数据差异和字符类别极度长尾分布等挑战。 Method: 构建了一个包含22454张全页图像和198598个标注字符的大规模数据集，并提出两阶段检测-识别框架，引入LadderMoE机制增强预训练CLIP编码器，实现动态专家专业化以提升跨域鲁棒性。 Result: 在单字符和全页识别任务上显著优于现有场景文字识别方法，在头部、中部和尾部类别及多种采集模态下均表现出更高准确率。 Conclusion: 该方法为青铜器铭文识别建立了强有力的基准，支持后续考古与历史研究。 Abstract: Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

[115] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen,Dat Nguyen

Main category: cs.CV

TL;DR: 提出VirDA方法，通过在骨干网络前添加域特定的视觉重编程层实现高效的无监督域自适应，显著减少训练参数和存储开销，同时保持高性能。

Details

Motivation: 现有UDA方法对每个新源-目标域对微调整个骨干网络，导致参数量和存储需求线性增长，且无法复用预训练骨干。受骨干网络存在纹理偏见的启发，希望利用域特定纹理偏见进行更高效的域适应。 Method: 提出VirDA，不微调骨干网络，而是在其前端添加一个域特定的视觉重编程层，生成作为纹理偏见的视觉提示来调整输入图像的“风格”以适应目标域；使用多个目标函数优化该层，以减小域内和域间分布差异。 Result: 在Office-31上达到92.8%的平均准确率，仅需150万可训练参数；相比高效基线PDA，准确率提升1.6%且仅用其46%的参数；相比全微调方法CDTrans和FixBi，在准确率相近或更高情况下，仅需1.7%和2.8%的参数；相比最强方法PMTrans和TVT，仅用约1.7%参数，准确率分别仅低2.2%和1.1%。 Conclusion: VirDA通过视觉重编程有效利用域特定纹理偏见，实现了高效、可复用的无监督域自适应，在大幅降低参数量的同时保持了竞争力的性能。 Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA.Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its ``style'' to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[116] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Minh Tran,Maksim Siniukov,Zhangyu Jin,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出了一种名为离散面部编码（DFE）的无监督、数据驱动方法，用于从3D网格序列中学习紧凑且可解释的面部表情字典，通过残差向量量化变分自编码器（RVQ-VAE）实现，在压力检测、人格预测和抑郁检测等心理任务中优于FACS和其他现有方法。

Details

Motivation: 现有的面部表情编码系统（如FACS）存在覆盖范围有限和人工标注成本高的问题，难以满足大规模心理与情感计算应用的需求。 Method: 首先使用3D可变形模型（3DMM）从图像中提取与身份无关的表情特征，分离头部姿态和面部几何等因素；然后利用残差向量量化变分自编码器（RVQ-VAE）对这些特征进行编码，生成共享码本中的离散token序列，每个token代表一种可复用的面部变形模式。 Result: 实验表明，DFE比FACS和其他面部编码方法能更精确地捕捉面部行为；在压力检测、人格预测和抑郁检测任务中，基于DFE的Bag-of-Words模型显著优于FACS和Masked Autoencoders等强基线模型；DFE具有更广泛的面部表情覆盖能力。 Conclusion: DFE是一种可扩展且有效的FACS替代方案，在心理学和情感计算应用中展现出巨大潜力。 Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

[117] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale

Yongbo Chen,Yanhao Zhang,Shaifali Parashar,Liang Zhao,Shoudong Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为Con-NRSfM的新方法，用于处理符合变形下的非刚性结构从运动恢复（NRSfM），通过图优化框架中的2D图像 warp 实现逐点重建，解耦了深度与共形尺度约束，并引入自监督学习生成带纹理的稠密3D点云，在合成与真实数据上均表现出更高的重建精度和鲁棒性。

Details

Motivation: 现有NRSfM方法依赖于局部平面或线性变形等强假设，且无法恢复共形尺度，限制了在单目视觉可变形SLAM中的应用精度与适用范围。 Method: 提出Con-NRSfM方法，利用图基框架优化选择的2D图像 warp 进行逐点重建，解耦深度与共形尺度的约束，并采用并行可分迭代优化策略提高稳定性，结合自监督编码-解码网络生成稠密带纹理的3D点云。 Result: 在合成与真实数据集上的实验表明，该方法在重建精度和鲁棒性方面优于现有方法，能够准确估计局部共形尺度和深度。 Conclusion: Con-NRSfM有效克服了传统NRSfM方法的局限性，适用于更广泛的变形场景，提升了单目视觉可变形SLAM中的三维重建性能。 Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.

[118] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao,Hongrui Wu,Ziyong Feng,Hujun Bao,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为UniVerse的统一框架，用于从不一致的多视角图像中进行鲁棒的3D场景重建。该方法将重建任务解耦为图像恢复和三维重建两个步骤，利用视频扩散模型从大规模数据中学习通用场景先验，从而有效处理多种图像不一致性，并在合成与真实数据上表现出优异的泛化能力和性能。

Details

Motivation: 现有基于神经3D场景表示的方法在处理不一致图像时依赖密集观测，难以稳健优化参数。为此，本文旨在提出一种更鲁棒、更具泛化能力的重建方法，以应对多样化的图像不一致性。 Method: 将鲁棒重建解耦为恢复与重建两个子任务；首先将不一致图像转换为初始视频，然后使用专门设计的视频扩散模型恢复出一致图像，最后从恢复后的图像中进行3D重建。 Result: 在合成和真实世界数据集上的实验表明，该方法在处理多种图像不一致性方面具有出色的泛化能力和优于现有方法的重建性能，同时还能控制重建3D场景的风格。 Conclusion: UniVerse通过引入基于视频扩散模型的统一框架，有效解决了多视角图像不一致下的鲁棒3D重建问题，展现出强大的恢复能力、良好的泛化性及风格可控性，为未来相关研究提供了新思路。 Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

[119] An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

Ke Jia,Ji Zhou,Hanxin Li,Zhigan Zhou,Haojie Chu,Xiaojie Li

Main category: cs.CV

TL;DR: 提出一种轻量级端到端模板匹配框架，将模板匹配重构为联合定位与几何回归问题，通过模板感知动态卷积模块（TDCM）实现高效、精确的位姿估计，支持旋转、缩放及复杂背景下的实时工业应用。

Details

Motivation: 传统方法在复合变换下效率低，而现有深度学习方法缺乏对几何位姿的显式建模，难以满足工业场景中对精度和效率的双重需求。 Method: 将模板匹配建模为联合中心点定位与几何参数（旋转角、纵横向缩放）回归任务；设计模板感知动态卷积模块（TDCM）在推理时注入模板特征以增强泛化能力；采用深度可分离卷积和像素重排实现高效网络结构；通过基于旋转-剪切的数据增强与结构感知伪标签实现无需几何标注的训练；引入轻量级优化模块提升角度与尺度精度。 Result: 模型仅3.07M参数，在复合变换下达到14ms推理速度，具备高精度、强鲁棒性，尤其在小模板和多目标场景表现优异。 Conclusion: 该方法在保持轻量化的同时实现了精确的几何状态估计，显著提升了模板匹配在复杂工业场景中的实用性与部署效率。 Abstract: In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

[120] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Xuchen Li,Xuzhao Li,Jiahui Gao,Renjie Pi,Shiyu Hu,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种自适应像素推理框架，通过操作感知的监督微调和回放引导的强化学习，使视觉-语言模型能根据查询难度动态决定是否调用像素级操作，在提升性能的同时显著减少不必要的计算开销。

Details

Motivation: 现有视觉-语言模型在处理需要精细视觉理解的任务时表现不佳，主要由于图像编码过程中的信息丢失或对关键区域关注不足；引入像素级信息虽有帮助，但常导致过度使用和分心。 Method: 首先采用操作感知的监督微调建立文本推理和视觉操作的基础能力，然后设计一种基于模型自身反馈的回放引导强化学习框架，使其能根据查询难度动态决定何时调用像素操作。 Result: 在多个多模态推理基准上实验表明，该模型在HR-Bench 4K上达到73.4%的准确率，工具使用率仅为20.1%，相比先前方法准确率更高且工具使用减少了66.5%。 Conclusion: 所提出的自适应像素推理框架能有效平衡性能与计算效率，实现了更智能、高效的多模态推理。 Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model's own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.

[121] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu,Wei-Ning Chiu,Shun-Ting Chang,Meng-Ping Huang,Takeshi Tohyama,Ahram Han,Po-Chih Kuo

Main category: cs.CV

TL;DR: 提出了一种基于增强敏感性的风险评分框架（ASRS），用于识别易出错的胸部X光片（CXR）病例，通过临床合理的旋转和RAD-DINO编码器测量嵌入变化，有效发现模型在不同患者亚组中的隐藏错误，提升医学AI的公平性和安全性。

Details

Motivation: 深度学习模型在胸片解读中表现良好，但存在跨患者亚组准确性不均的问题，传统误差检测方法难以捕捉分布内细微错误，缺乏有效的无标签错误识别手段。 Method: 提出ASRS框架，对CXR图像施加±15°/±30°的旋转增强，利用RAD-DINO编码器测量嵌入空间的变化，计算敏感性得分并划分稳定性四分位数，以识别易错样本。 Result: 高敏感性样本的召回率显著降低（-0.2至-0.3），尽管模型整体AUROC和置信度较高；ASRS能有效识别出传统指标无法反映的隐藏错误。 Conclusion: ASRS提供了一种无需标签的误差检测方法，可用于选择性预测和辅助医生复核，提升医学AI系统的公平性与可靠性。 Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

[122] FreeViS: Training-free Video Stylization with Inconsistent References

Jiacong Xu,Yiqun Mei,Ke Zhang,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频风格化框架FreeViS，通过融合多个风格参考图像到预训练的图像到视频模型中，在保持时间一致性的同时生成富含细节的风格化视频。

Details

Motivation: 传统逐帧图像风格化方法存在时间不一致和风格贫乏问题，而专用视频风格化模型通常需要配对视频数据且计算成本高。 Method: FreeViS将多个风格参考图像整合进预训练的图像到视频模型，并引入高频补偿约束内容布局与运动，结合基于光流的运动线索保留低显著性区域的风格纹理。 Result: FreeViS在风格化保真度和时间一致性方面优于近期基线方法，表现出更高的用户偏好，且无需训练，计算高效。 Conclusion: FreeViS提供了一种高质量、时间连贯、实用且经济的视频风格化解决方案。 Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

[123] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu,Jinjie Wei,Wanying Qu,Chenglong Ma,Junzhi Ning,Yunheng Li,Ying Chen,Xinzhe Luo,Pengcheng Chen,Xin Gao,Ming Hu,Huihui Xu,Xin Wang,Shujian Gao,Dingkang Yang,Zhongying Deng,Jin Ye,Lihao Liu,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出了MedQ-Bench，一个基于多模态大语言模型（MLLM）的医学图像质量评估基准，引入感知与推理双任务范式，涵盖五种成像模态和多种质量属性，通过多维评判协议评估模型表现，并验证其与放射科医生判断的一致性，揭示当前MLLM在医学图像质量评估中能力初步但不稳定，亟需优化。

Details

Motivation: 现有医学图像质量评估方法依赖于标量评分指标，无法反映专家评估中描述性、类人推理的过程，缺乏能够模拟人类感知与推理能力的语言化评估框架。 Method: 构建MedQ-Bench基准，包含MedQ-Perception（低层次感知任务）和MedQ-Reasoning（无参考与比较型推理任务），覆盖五种成像模态和超过40种质量属性；提出四维度评判协议，并通过与放射科医生判断对比进行人类-AI对齐验证。 Result: 在14种最先进的MLLM上的实验表明，当前模型具备初步但不稳定的感知与推理能力，在准确率上尚不足以支持可靠临床应用；人类-AI对齐分析显示仍有显著差距。 Conclusion: MedQ-Bench建立了面向语言化评估的医学图像质量评测新范式，揭示了现有MLLM在该任务上的局限性，为未来针对性优化提供了方向和基础平台。 Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

[124] Holistic Order Prediction in Natural Scenes

Pierre Musacchio,Hyunmin Lee,Jaesik Park

Main category: cs.CV

TL;DR: 提出InstaFormer，一种仅通过单次前向传播即可从RGB图像预测场景中所有实例的遮挡和深度顺序的网络。

Details

Motivation: 现有方法依赖昂贵的输入格式（如类别标签、二值分割掩码）和高推理成本（二次方数量级的前向传播），难以高效获取实例间的几何关系。 Method: 设计InstaFormer，利用对象查询与潜在掩码描述符之间的交互，语义上表示相同对象并提供互补信息，实现端到端的全序预测。 Result: 在多个基准上进行了全面评估和消融实验，验证了方法在单次前向传播下有效预测实例间遮挡和深度顺序的能力。 Conclusion: InstaFormer能够以较低成本实现对视觉场景中实例几何关系的全面理解，推动了无需复杂输入的密集空间推理技术的发展。 Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.

[125] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

Raahul Krishna Durairaju,K. Saruladha

Main category: cs.CV

TL;DR: PyramidStyler是一种基于Transformer的神经风格迁移框架，引入金字塔位置编码（PPE）和强化学习，实现高效、高质量的实时艺术图像生成。

Details

Motivation: 现有CNN和Transformer模型在处理复杂风格和高分辨率图像时扩展性差、计算开销大，难以实现实时高质量风格迁移。 Method: 提出PyramidStyler框架，采用金字塔位置编码（PPE）进行多尺度特征建模，并结合强化学习动态优化风格化过程，提升收敛速度和生成质量。 Result: 在COCO和WikiArt数据集上训练后，4000轮后内容损失降至2.07，风格损失降至0.86，推理时间仅1.39秒；引入强化学习后进一步降低至2.03和0.75，推理时间为1.40秒。 Conclusion: PyramidStyler实现了高效、可扩展的高质量风格迁移，支持实时应用，在媒体与设计领域具有广泛应用前景。 Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

[126] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction

Sheng-Hsiang Hung,Ting-Yu Yen,Wei-Fang Sun,Simon See,Shih-Hsuan Hung,Hung-Kuo Chu

Main category: cs.CV

TL;DR: 提出LoBE-GS，一种负载均衡且高效的3D高斯点阵化框架，通过深度感知分割、优化分配和轻量技术显著加速大规模场景训练。

Details

Motivation: 现有3DGS方法在处理大范围非受限场景时存在内存压力和负载不平衡问题，且粗到精流程效率低下。 Method: 引入深度感知的分割方法、基于优化的可见高斯分布平衡策略，以及可见性裁剪和选择性稠密化两种轻量技术。 Result: 在大规模城市场景和户外数据集上，端到端训练速度比现有最优方法快达2倍，同时保持重建质量。 Conclusion: LoBE-GS有效解决了3DGS在大规模场景下的可扩展性和效率瓶颈，显著提升训练速度并支持更大规模场景重建。 Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians -- a strong proxy for computational load -- across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.

[127] Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu,Guozhen Zhang,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Xuming He

Main category: cs.CV

TL;DR: 提出MemoryPack和Direct Forcing方法，提升长视频生成中的时序一致性和减少误差累积。

Details

Motivation: 解决长视频生成中捕捉长程依赖和自回归解码误差累积的双重挑战。 Method: MemoryPack利用文本和图像信息作为全局引导，建模长短时依赖；Direct Forcing通过单步近似策略改善训练与推理的一致性。 Result: 在分钟级长视频生成中实现了更好的上下文一致性和更低的误差传播，提升了自回归视频模型的实用性。 Conclusion: MemoryPack和Direct Forcing有效增强了长视频生成的质量和可靠性。 Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

[128] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving

Cornelius Schröder,Marius-Raphael Schlüter,Markus Lienkamp

Main category: cs.CV

TL;DR: 本文研究了3D目标检测器分类任务中的置信度校准问题，提出两种辅助正则化损失项，并结合等渗回归方法，在CenterPoint和PillarNet上实现了对主导类和次级类预测的良好校准，但发现DSVT-Pillar无法用相同方法同时校准两者。

Details

Motivation: 精确的对象检测和不确定性估计对于自主系统的安全运行至关重要，而现有的3D目标检测器在分类置信度校准方面存在不足，特别是对所有类别（包括主导和次要类别）的完整预测置信分布的校准缺乏关注。 Method: 提出了两个辅助的正则化损失项：一个用于校准主导预测，另一个用于校准完整的预测向量；并在CenterPoint、PillarNet和DSVT-Pillar三种模型上评估了多种训练时和后处理校准方法的效果。 Result: 结合全类别预测校准的损失项与等渗回归方法，在CenterPoint和PillarNet上取得了最佳的校准效果，兼顾主导和次级类别的预测校准；但该组合方法无法同时校准DSVT-Pillar的主导和次级预测。 Conclusion: 对完整预测向量进行校准的训练目标有助于提升3D目标检测器的置信度可靠性，尤其是在多类别预测场景下，但不同架构的模型可能需要不同的校准策略。 Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.

[129] Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim,Sooyoung Yang,Jihyong Oh,Myungjoo Kang,Chanho Eom

Main category: cs.CV

TL;DR: 提出DiffPS框架，利用预训练扩散模型解决行人搜索中检测与重识别任务的优化冲突，通过三个专用模块提升性能，在CUHK-SYSU和PRW数据集上达到SOTA。

Details

Motivation: 现有方法使用ImageNet预训练骨干网络且共享特征进行检测与重识别，难以捕捉复杂空间上下文和细粒度身份特征，并因优化目标冲突导致性能受限。 Method: 提出DiffPS框架，利用扩散模型先验知识，设计三个模块：扩散引导的区域提议网络（DGRPN）增强定位，多尺度频率 refinement 网络（MSFRN）缓解形状偏置，语义自适应特征聚合网络（SFAN）利用文本对齐的扩散特征，解耦检测与重识别的特征学习。 Result: 在CUHK-SYSU和PRW两个主流行人搜索数据集上取得新的最先进性能。 Conclusion: DiffPS有效利用扩散模型先验知识，解决了检测与重识别任务间的优化冲突，提升了行人搜索的整体性能，展示了扩散模型在该领域的应用潜力。 Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

[130] Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction

Yi Ai,Yuanhao Cai,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 提出了一种名为Flow-Matching-guided Unfolding network (FMU)的新型高光谱成像重建方法，首次将流匹配引入压缩感知重建框架，结合生成先验与深度展开结构，并通过均速度损失增强流的一致性，显著提升了重建质量。

Details

Motivation: 高光谱成像因硬件限制和从压缩测量中重建三维数据的困难而成本高昂，现有压缩感知方法在恢复精细光谱细节方面仍存在严重退化问题。 Method: 将流匹配的生成先验嵌入深度展开网络框架，并引入均速度损失以增强流的全局一致性，结合了优化方法的可解释性与流匹配的生成能力。 Result: 在模拟和真实数据集上的大量实验表明，FMU在重建质量上显著优于现有方法。 Conclusion: FMU通过融合流匹配与深度展开架构，有效提升了高光谱图像重建的精度与鲁棒性，为压缩感知下的高效高质量重建提供了新思路。 Abstract: Hyperspectral imaging (HSI) provides rich spatial-spectral information but remains costly to acquire due to hardware limitations and the difficulty of reconstructing three-dimensional data from compressed measurements. Although compressive sensing systems such as CASSI improve efficiency, accurate reconstruction is still challenged by severe degradation and loss of fine spectral details. We propose the Flow-Matching-guided Unfolding network (FMU), which, to our knowledge, is the first to integrate flow matching into HSI reconstruction by embedding its generative prior within a deep unfolding framework. To further strengthen the learned dynamics, we introduce a mean velocity loss that enforces global consistency of the flow, leading to a more robust and accurate reconstruction. This hybrid design leverages the interpretability of optimization-based methods and the generative capacity of flow matching. Extensive experiments on both simulated and real datasets show that FMU significantly outperforms existing approaches in reconstruction quality. Code and models will be available at https://github.com/YiAi03/FMU.

[131] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

Wei-Lung Mao,Chun-Chi Wang,Po-Heng Chou,Yen-Ting Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习和ConSinGAN生成数据的DIP封装自动缺陷检测系统，结合YOLO模型与SCADA系统，有效解决了工业中因缺陷样本不足导致的检测难题，具有高准确率（95.50%）和快速检测能力（285ms）。

Details

Motivation: 传统工业元件缺陷检测耗时且依赖人力，导致质检负担重、质量控制困难，亟需自动化解决方案。 Method: 采用数字相机光学系统采集图像，利用ConSinGAN生成缺陷数据以解决样本不足问题，并对比YOLOv3、v4、v7、v9四种模型在有无数据增强下的表现，最终构建集成SCADA系统的自动化检测框架。 Result: YOLOv7结合ConSinGAN在准确率（95.50%）和检测时间（285ms）上优于其他YOLO版本，显著优于基于阈值的传统方法，且系统可扩展性强，适用于多种缺陷类型和数据稀缺场景。 Conclusion: 该研究成功实现了DIP元件的高效自动缺陷检测，通过生成对抗网络弥补数据不足，提升了深度学习模型在工业检测中的实用性与可行性。 Abstract: Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50\%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.

[132] Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Guangyao Zhai,Yue Zhou,Xinyan Deng,Lars Heckler,Nassir Navab,Benjamin Busam

Main category: cs.CV

TL;DR: 本文提出了一种名为FoundAD的少样本异常检测方法，利用大规模预训练视觉编码器学习到的正常图像分布特性，通过非线性投影算子将特征映射到自然图像流形上，从而有效识别图像中的异常区域。

Details

Motivation: 由于少样本条件下准确区分正常与异常特征具有挑战性，尤其是在类别无关的情况下，因此需要一种能够充分利用预训练模型中通用图像分布知识的方法来提升检测性能。 Method: 通过分析图像中异常程度与学习嵌入差异之间的相关性，设计了一个非线性投影算子，将其应用于基础视觉编码器生成的特征，以刻画并识别图像中的分布外区域（异常）。 Result: 实验表明，该方法在多类异常检测任务中表现出竞争力，且参数量显著少于先前方法，同时兼容多种基础编码器（如DINOv3）。 Conclusion: FoundAD为基于基础模型特征的少样本异常检测提供了新视角，有效平衡了性能与模型复杂度，推动了该领域的发展。 Abstract: Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

[133] ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello,Ronja Güldenring,Lazaros Nalpantidis

Main category: cs.CV

TL;DR: 提出ClustViT，通过可训练的聚类模块合并相似token并用重建模块恢复细节，显著降低计算量且保持语义分割精度。

Details

Motivation: Vision Transformers因二次注意力复杂度难以应用于实际机器人系统，现有token合并方法在密集预测任务（如语义分割）中表现不佳。 Method: 在ViT基础上引入可学习的Cluster模块，基于分割掩码生成的伪聚类指导token合并；再通过Regenerator模块恢复细节以支持下游密集预测任务。 Result: 在三个数据集上实现最多2.18倍GFLOPs减少和1.64倍推理加速，同时保持相当的分割精度。 Conclusion: ClustViT有效平衡了Vision Transformer在语义分割任务中的效率与精度，提升了其在真实世界机器人系统中的适用性。 Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Yongyi Su,Haojie Zhang,Shijie Li,Nanqing Liu,Jingyi Liao,Junyi Pan,Yuan Liu,Xiaofen Xing,Chong Sun,Chen Li,Nancy F. Chen,Shuicheng Yan,Xulei Yang,Xun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Patch-as-Decodable Token (PaDT)的统一范式，使多模态大语言模型（MLLMs）能够直接生成文本和多样化的视觉输出，通过引入视觉参考令牌（VRTs）和轻量级解码器，在检测、分割和定位任务中实现了最先进的性能。

Details

Motivation: 现有的MLLM在视觉任务中通常依赖间接表示（如用文本生成坐标），限制了性能，难以支持密集预测任务（如分割）。因此需要一种能直接生成视觉输出的统一框架。 Method: 提出PaDT框架，使用从图像patch嵌入得到的视觉参考令牌（VRTs），与LLM的文本输出令牌交错，并通过轻量级解码器将LLM输出转换为检测、分割和定位结果；VRT在每次前向传播中独立处理，并动态扩展嵌入表以提升性能。 Result: 在四个视觉感知与理解任务上的实验表明，PaDT持续达到最先进的性能，甚至优于更大规模的MLLM模型。 Conclusion: PaDT为MLLM提供了一种统一且高效的途径来同时处理文本生成和密集视觉预测任务，显著提升了多模态模型在视觉理解方面的表现。 Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

[135] TriAlignXA: An Explainable Trilemma Alignment Framework for Trustworthy Agri-product Grading

Jianfei Xie,Ziyang Li

Main category: cs.CV

TL;DR: 本文提出了一种可解释AI框架TriAlignXA，通过构建“信任金字塔”模型和“三角信任指数”（TTI），解决在线果蔬电商中的信任问题，平衡农产品分级中的生物特性、时效性与经济性之间的‘不可能三角’，并通过预映射机制提升质量信息透明度。

Details

Motivation: 由于线上交易无法提供对果蔬品质的直接感官体验，导致消费者信任不足，制约了生鲜电商的发展。传统绝对分级标准难以应对农产品固有的变异性和时效性挑战，亟需新的理论模型与技术手段来重建消费者信任。 Method: 采用‘双源验证’方法构建‘信任金字塔’模型；提出‘三角信任指数’（TTI）量化农产品分级中的多目标权衡；设计TriAlignXA可解释AI框架，包含生物适应引擎、时效优化引擎和经济优化引擎；引入预映射机制，将过程数据编码为QR码以增强透明度。 Result: 实验表明，所提模型在分级任务中显著优于基线模型；实证与理论分析验证了框架在调和‘不可能三角’方面的有效性；QR码机制提升了消费者对质量信息的信任感知。 Conclusion: TriAlignXA框架通过多目标优化与决策透明化，成功连接算法决策与消费者信任，为构建可信的在线农产品交易生态提供了从理论到实践的完整路径。 Abstract: The 'trust deficit' in online fruit and vegetable e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality. This paper constructs a 'Trust Pyramid' model through 'dual-source verification' of consumer trust. Experiments confirm that quality is the cornerstone of trust. The study reveals an 'impossible triangle' in agricultural product grading, comprising biological characteristics, timeliness, and economic viability, highlighting the limitations of traditional absolute grading standards. To quantitatively assess this trade-off, we propose the 'Triangular Trust Index' (TTI). We redefine the role of algorithms from 'decision-makers' to 'providers of transparent decision-making bases', designing the explainable AI framework--TriAlignXA. This framework supports trustworthy online transactions within agricultural constraints through multi-objective optimization. Its core relies on three engines: the Bio-Adaptive Engine for granular quality description; the Timeliness Optimization Engine for processing efficiency; and the Economic Optimization Engine for cost control. Additionally, the "Pre-Mapping Mechanism" encodes process data into QR codes, transparently conveying quality information. Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. Empirical evidence and theoretical analysis verify the framework's balancing capability in addressing the "impossible triangle". This research provides comprehensive support--from theory to practice--for building a trustworthy online produce ecosystem, establishing a critical pathway from algorithmic decision-making to consumer trust.

[136] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

Lei Liu,Can Wang,Zhenghao Chen,Dong Xu

Main category: cs.CV

TL;DR: 提出4DGS-Craft，一种具有一致性和交互性的4D高斯泼溅编辑框架，通过4D感知的InstructPix2Pix模型、多视角网格模块和高斯选择机制实现视图、时间和非编辑区域的一致性，并利用基于大语言模型的模块解析复杂用户指令。

Details

Motivation: 现有4D高斯泼溅编辑方法在视图、时间及非编辑区域一致性方面存在不足，且难以处理复杂文本指令，因此需要一个更一致且可交互的编辑框架。 Method: 引入4D感知的InstructPix2Pix模型，结合4D VGGT几何特征和多视角网格模块以增强一致性；设计基于LLM的用户意图理解模块，将复杂指令分解为原子操作序列；并通过高斯选择机制仅优化编辑区域内的高斯分布。 Result: 实现了在视图、时间和非编辑区域上更一致的4D场景编辑，能够准确解析并执行复杂文本指令，显著提升编辑的可控性和一致性。 Conclusion: 4DGS-Craft有效解决了4D高斯泼溅编辑中的一致性与交互性难题，支持复杂指令下的精确编辑，推动了4D场景编辑的发展。 Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.

[137] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution

Junyu Wu,Jie Tang,Jie Liu,Gangshan Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为Pure-Pass（PP）的像素级掩码机制，用于图像超分辨率任务中减少计算开销。该方法通过固定颜色中心点对像素进行分类，识别出无需复杂计算的“纯像素”，从而实现细粒度、空间灵活且自适应的计算路由，在保持低计算成本的同时提升了重建质量与参数效率。

Details

Motivation: 现有轻量级图像超分方法如CAMixer存在适应性差、掩码粗糙和空间灵活性不足等问题，难以兼顾高效计算与高质量重建，因此需要一种更精细、灵活的计算分配机制。 Method: 提出Pure-Pass（PP）像素级掩码机制，利用固定颜色中心点对像素进行分类，识别并跳过内容简单的纯像素，仅在必要区域执行复杂计算，并将其集成到ATD-light模型中。 Result: PP-ATD-light在节省相似计算量的情况下，优于CAMixer-ATD-light，在重建质量和参数效率方面表现更优。 Conclusion: Pure-Pass机制实现了细粒度、空间灵活且自适应的计算分配，有效提升了轻量级超分模型的性能与效率，为高效图像超分提供了新思路。 Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.

[138] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa,Ryo Takahashi,Tomoya Kitano,Yukihiro Iida,Chisako Muramatsu,Tatsuro Hayashi,Yuta Seino,Xiangrong Zhou,Takeshi Hara,Akitoshi Katsumata,Hiroshi Fujita

Main category: cs.CV

TL;DR: 本研究利用GPT-4o的多模态能力，提出一种带结构化输出的自校正循环（SLSO）框架，以自动生成牙科全景片中的颌骨囊肿发现，相较于传统思维链方法在多个评估项上提升了准确性。

Details

Motivation: 提高AI在医学影像报告生成中的准确性和可靠性，减少幻觉和错误，特别是在牙科囊肿的自动发现描述中。 Method: 构建了一个10步的SLSO框架，包括图像输入、结构化数据生成、牙齿编号提取与一致性检查、不一致时迭代再生，并结合GPT-4o的多模态能力生成最终报告，与传统的Chain-of-Thought方法进行对比实验。 Result: SLSO框架在牙齿编号、牙齿移位和牙根吸收方面分别提升了66.9%、33.3%和28.6%的准确率；最多五次迭代后可实现一致的结构化输出，有效抑制了幻觉并增强了负性发现的描述，但在跨多牙的广泛病变识别上仍有局限。 Conclusion: SLSO框架显著提升了自动影像报告的准确性和结构一致性，尽管数据集较小未达统计显著性，但仍显示出潜力，未来需进一步优化以实现临床实用化。 Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.

[139] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction

Mario Resino,Borja Pérez,Jaime Godoy,Abdulla Al-Kaff,Fernando García

Main category: cs.CV

TL;DR: 提出了一种名为LiLa-Net的3D自编码器架构，仅使用LiDAR点云数据高效编码真实交通环境特征，通过简化跳跃连接和减少编码层，在保持高性能的同时实现高质量重构，并展现出强泛化能力。

Details

Motivation: 为了在不依赖复杂资源的情况下，从真实交通环境中高效提取LiDAR点云的有效特征，提升点云重建性能与泛化能力。 Method: 设计了一种基于跳跃连接的轻量级3D自编码器LiLa-Net，减少编码器层数并优化跳跃连接结构，平衡潜在编码与跳跃信息的贡献。 Result: 模型在真实交通环境中能高效重建原始点云，质量高且资源消耗低，并能成功重建非交通相关物体，表现出良好泛化性。 Conclusion: LiLa-Net在减少网络复杂度的同时保持了高效特征提取与高质量重建能力，是一种适用于实际交通场景的轻量级点云编码方案。 Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR's point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.

[140] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring

Jenna Kline,Maksim Kholiavchenko,Samuel Stevens,Nina van Tiel,Alison Zhong,Namrata Banerji,Alec Sheets,Sowbaranika Balasubramaniam,Isla Duporge,Matthew Thompson,Elizabeth Campolongo,Jackson Miliko,Neil Rosser,Tanya Berger-Wolf,Charles V. Stewart,Daniel I. Rubenstein

Main category: cs.CV

TL;DR: 本文提出了一种名为kabr-tools的开源工具包，结合无人机视频与机器学习技术，实现对野生动物多物种行为的自动化监测，显著提升了行为识别的精度与规模。

Details

Motivation: 传统野外观察方法在范围、时间和人力上受限，难以全面捕捉复杂的行为模式，因此需要一种可扩展的技术手段来提升行为生态学研究的能力。 Method: 开发了一个集成无人机视频与机器学习的分析框架kabr-tools，包含目标检测、追踪和行为分类系统，用于提取时间预算、行为转换、社会互动、栖息地关联和群体动态等指标。 Result: 相比地面观测，无人机方法减少了15%的视野丢失，捕获了更多且更准确连续的行为转换；通过三个案例研究分析了969个行为序列，验证了工具有效性；发现Grevy斑马和普通斑马的警觉性随群体大小增加而降低，但栖息地仅对后者有影响，且两种斑马均表现出强烈的行为惯性，并在混合群体中存在空间隔离。 Conclusion: kabr-tools实现了大规模自动化行为监测，为生态系统范围的研究、生物多样性保护和生态监测提供了强大工具。 Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy's zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy's zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy's zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.

[141] GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing

Mengtian Li,Yunshu Bai,Yimin Chu,Yijun Shen,Zhongmei Li,Weifeng Ge,Zhifeng Xie,Chaofeng Chen

Main category: cs.CV

TL;DR: 提出了一种名为GaussianMorphing的新框架，用于从多视角图像进行语义感知的3D形状和纹理变形。该方法利用网格引导的3D高斯点阵实现高保真几何与外观建模，并通过统一的变形策略确保几何一致性与纹理保真度。

Details

Motivation: 现有方法依赖点云或需要预定义的同胚映射，难以处理无纹理数据且缺乏语义一致性，因此需要一种无需标注数据、能保持几何和纹理一致性的新方法。 Method: 采用网格引导的3D高斯点阵（3DGS）进行建模，将3D高斯锚定到重建的网格片上，结合拓扑感知约束和基于物理合理轨迹的无监督语义对应建立形变框架。 Result: 在提出的TexMorph基准上显著优于先前2D/3D方法，颜色一致性误差（ΔE）降低22.2%，EI降低26.2%。 Conclusion: GaussianMorphing实现了高质量、语义一致的3D形态变换，无需标签数据，在几何一致性、纹理保真和结构完整性方面表现优越。 Abstract: We introduce GaussianMorphing, a novel framework for semantic-aware 3D shape and texture morphing from multi-view images. Previous approaches usually rely on point clouds or require pre-defined homeomorphic mappings for untextured data. Our method overcomes these limitations by leveraging mesh-guided 3D Gaussian Splatting (3DGS) for high-fidelity geometry and appearance modeling. The core of our framework is a unified deformation strategy that anchors 3DGaussians to reconstructed mesh patches, ensuring geometrically consistent transformations while preserving texture fidelity through topology-aware constraints. In parallel, our framework establishes unsupervised semantic correspondence by using the mesh topology as a geometric prior and maintains structural integrity via physically plausible point trajectories. This integrated approach preserves both local detail and global semantic coherence throughout the morphing process with out requiring labeled data. On our proposed TexMorph benchmark, GaussianMorphing substantially outperforms prior 2D/3D methods, reducing color consistency error ($\Delta E$) by 22.2% and EI by 26.2%. Project page: https://baiyunshu.github.io/GAUSSIANMORPHING.github.io/

[142] Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor,Romit Roy Choudhury

Main category: cs.CV

TL;DR: 本文提出了一种名为InPose的零样本通用姿态估计方法，仅使用身体传感器的旋转测量值作为条件，并结合位置测量值构建似然项，通过预训练扩散模型生成最可能的姿态序列。

Details

Motivation: 现有基于条件扩散模型的姿态估计方法在跨用户场景下泛化能力差，主要因为位置测量受用户体型影响大，难以适应不同用户。 Method: 将姿态估计建模为逆问题，利用预训练扩散模型仅以旋转测量为条件生成先验，并通过位置测量构造的似然项进行引导，实现无需用户特定训练的姿态估计。 Result: InPose实现了跨用户的零样本泛化，在仅有少量体上传感器的情况下仍能准确估计全身姿态。 Conclusion: 该方法有效解决了传统方法在跨用户姿态估计中的泛化难题，展示了在稀疏传感器输入下的强大性能和应用潜力。 Abstract: Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

[143] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation

Arman Behnam

Main category: cs.CV

TL;DR: 提出了一种基于视觉Transformer的扩散模型VGDM，用于脑肿瘤检测与分割，通过结合全局上下文推理和迭代去噪，在MRI图像中实现了更高的分割精度和边界细节恢复。

Details

Motivation: 传统U-Net等卷积模型在捕捉长距离依赖关系方面能力有限，难以准确分割复杂肿瘤结构，因此需要更强大的模型来提升脑肿瘤分割性能。 Method: 提出VGDM框架，将视觉Transformer嵌入扩散模型的核心，在整个扩散过程中利用Transformer建模全局空间关系，并通过迭代去噪优化分割结果，实现对MRI体积数据的精准分割。 Result: 在脑肿瘤MRI数据集上的实验表明，该方法在Dice相似系数和Hausdorff距离上均优于传统方法，显著提升了分割精度和边界质量。 Conclusion: VGDM通过融合Transformer的全局建模能力和扩散模型的精细修复特性，为脑肿瘤分割提供了更鲁棒、可扩展的解决方案，推动了该领域技术的发展。 Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.

[144] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

Walid Rabehi,Marion Le Texier,Rémi Lemoy

Main category: cs.CV

TL;DR: 本研究提出了一种双通道U-Net深度学习方法，从1925-1950年的Scan Histo历史地图中提取法国全国范围的城市用地数据，生成了首个该时期公开的国家级城市足迹数据集。

Details

Motivation: 在1970年代之前，由于缺乏全国性的数字化城市足迹数据，对法国历史城市扩张的定量分析受到限制。 Method: 采用双通道U-Net模型：第一阶段生成初步结果并识别混淆区域（如文字和道路），用于指导数据增强；第二阶段利用优化后的数据集和第一阶段的二值化输出减少辐射噪声，降低误检率，并在高性能计算集群上处理941个高分辨率图块。 Result: 成功生成覆盖整个法国本土的城市足迹镶嵌图，总体精度达到73%，有效捕捉多种城市模式，克服了标签和等高线等常见干扰因素。 Conclusion: 该方法为历史地图信息提取提供了可扩展的解决方案，所发布的代码、训练数据和全国城市栅格数据集将支持长期城市化动态研究。 Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.

[145] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

Woowon Jang,Jiwon Im,Juseung Choi,Niki Rashidian,Wesley De Neve,Utku Ozbulak

Main category: cs.CV

TL;DR: 本文系统分析了基于点的跟踪在腹腔镜胆囊切除术视频中的失败模式，比较了其与分割掩码初始化的性能差异，并提出了改进手术视频分析中跟踪性能的建议。

Details

Motivation: 理解基于点的跟踪在复杂外科环境中的可靠性和失败情况，特别是在使用最小用户输入的情况下。 Method: 通过对手术目标（胆囊、抓钳和L钩电烙器）进行基于点的跟踪与分割掩码初始化的性能比较，分析其失败模式。 Result: 基于点的跟踪在外科工具上表现良好，但在解剖结构目标上由于组织相似性和边界模糊而表现较差。 Conclusion: 提供了选择和放置跟踪点以提高性能的具体建议，强调了在不同目标类型中采用适当跟踪方法的重要性。 Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

[146] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

Ding-Ruei Shen

Main category: cs.CV

TL;DR: 本文提出了FFREEDG这一新任务，即在无需访问源域数据且客户端仅有无标签数据的情况下，利用视觉-语言模型进行联邦语义分割；为此设计了FRIEREN框架，结合CLIP文本嵌入指导的视觉-语言解码器与弱到强一致性学习策略，在跨域场景中表现出色。

Details

Motivation: 现有联邦学习方法通常假设客户端有标注数据或未能充分利用视觉基础模型，难以应对真实场景中的域偏移问题，尤其是在无标签情况下缺乏有效解决方案。 Method: 提出FRIEREN框架：利用预训练视觉-语言基础模型，通过CLIP生成的文本嵌入引导视觉-语言解码器以增强语义区分能力，并采用弱增强与强增强之间的一致性学习策略，基于伪标签进行鲁棒的本地训练。 Result: 在合成到真实、晴朗到恶劣天气等多个跨域基准上验证了方法的有效性，性能优于现有的域泛化与自适应方法，为无标签联邦语义分割任务建立了强有力的基线。 Conclusion: FRIEREN成功解决了FFREEDG任务，在不访问源数据且客户端无标签的条件下实现了有效的联邦语义分割，展示了视觉-语言模型与一致性学习结合的巨大潜力。 Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

[147] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Shu Zou,Xinyu Tian,Lukas Wesemann,Fabian Waschkowski,Zhaoyuan Yang,Jing Zhang

Main category: cs.CV

TL;DR: 提出ASK-Hint，一种基于动作中心知识的结构化提示框架，用于提升冻结视觉语言模型在视频异常检测中的性能。

Details

Motivation: 现有提示方法过于抽象，忽略了定义复杂异常的细粒度人-物交互和动作语义。 Method: 将提示组织为语义连贯的类别，并设计细粒度引导问题，使模型预测与判别性视觉线索对齐。 Result: 在UCF-Crime和XD-Violence上显著提升AUC，达到SOTA性能，且具备跨数据集和模型的强泛化能力。 Conclusion: 提示粒度至关重要，ASK-Hint是一种无需训练、可泛化且可解释的视频异常检测新方案。 Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

[148] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou,Xu Zhang,Yi Bin,Jian Liu,Bo Peng,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出GeoPurify，通过利用隐含的几何信息和几何引导的去噪机制，在仅使用约1.5%训练数据的情况下，有效缓解了2D视觉语言模型向3D语义分割迁移中的噪声与数据效率问题。

Details

Motivation: 现有的2D到3D特征迁移方法在预测质量与训练成本之间存在权衡：直接投影导致噪声多、片段化严重，而保证几何一致性的方法依赖大量标注数据和复杂训练流程。其根本原因在于主流的分割-匹配范式难以融合2D语义与3D几何结构。 Method: 提出GeoPurify，包含一个小型学生亲和网络，利用从3D自监督教师模型中提取的几何先验来净化由2D VLM生成的3D点特征；并在推理阶段引入几何引导池化模块，进一步去噪并提升语义与结构的一致性。 Result: 在多个主流3D基准上实验表明，GeoPurify在仅使用约1.5%训练数据的情况下，性能达到或超越当前最先进方法。 Conclusion: GeoPurify通过挖掘2D-to-3D迁移中残留的几何线索，结合轻量级网络设计与推理优化，显著提升了数据效率与分割质量，打破了传统方法在性能与成本间的权衡。 Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).

[149] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications

Emmanuel Nsengiyumvaa,Leonard Niyitegekaa,Eric Umuhoza

Main category: cs.CV

TL;DR: 提出一种基于耳部静脉模式的非侵入式猪只生物识别方法，使用智能手机拍摄的耳部图像，通过计算机视觉和机器学习（SVM准确率达98.12%）实现高效、低成本的个体识别，适用于混合品种猪，具有实时部署潜力。

Details

Motivation: 现有猪只识别方法（如耳标、芯片）成本高、易损坏、多针对纯种猪，难以适用于小规模养殖户，因此需要一种可靠、低成本且适用于混种猪的非侵入式识别方案。 Method: 采集20头混种猪的800张耳部图像，利用手机拍摄并辅以背光；构建多阶段计算机视觉流程增强静脉可见性，提取结构与空间特征，并生成生物特征签名，使用支持向量机（SVM）等机器学习模型进行分类识别。 Result: SVM模型在混种猪群体中实现了98.12%的识别准确率，平均处理时间为8.3秒，验证了系统的高效性与实时可行性。 Conclusion: 基于耳部静脉的生物识别技术是一种可行、稳定且低成本的猪只识别方案，可替代传统物理标识，有助于推动资源有限地区畜牧业的数字化与精准化管理。 Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.

[150] MMDEW: Multipurpose Multiclass Density Estimation in the Wild

Villanelle O'Reilly,Jonathan Cox,Georgios Leontidis,Marc Hanheide,Petra Bosilj,James Brown

Main category: cs.CV

TL;DR: 提出了一种基于Twins金字塔视觉Transformer的多类别密度图估计方法，通过引入类别聚焦模块和多尺度解码，在密集场景下显著优于现有方法，并在生物多样性监测等新领域展示了应用潜力。

Details

Motivation: 传统检测方法在密集遮挡场景中难以准确计数，因此需要一种能够有效估计密度图的多类别计数方法。 Method: 采用Twins金字塔视觉Transformer作为骨干网络，结合多尺度解码的专用多类别计数头，并设计了一个基于分割的类别聚焦模块以抑制训练时的类别间干扰。 Result: 在VisDrone和iSAID基准上显著优于先前方法（MAE分别降低33%、43%和64%），且相比YOLOv11验证了密集场景下密度估计的必要性；进一步应用于生物多样性监测数据集，展示了其跨领域潜力。 Conclusion: 所提方法在多类别密集计数任务中表现优越，具备扩展到生态保护等新领域的潜力，推动可扩展的生态洞察与 conservation 应用。 Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

[151] TempoControl: Temporal Attention Guidance for Text-to-Video Models

Shira Schiber,Ofir Lindenbaum,Idan Schwartz

Main category: cs.CV

TL;DR: TempoControl 是一种无需重新训练或额外监督即可实现文本到视频生成中视觉概念时间对齐的方法，利用交叉注意力图通过相关性、能量和熵三个原则优化概念时序控制。

Details

Motivation: 现有的生成视频模型缺乏对视觉元素出现时间的细粒度控制，难以满足用户对视频时序精确调控的需求。 Method: 提出 TempoControl 方法，基于文本到视频扩散模型中的交叉注意力图，设计了一种新颖的优化方法，通过控制信号的相关性对齐时间形状、通过能量增强关键帧注意力、通过熵保持空间聚焦。 Result: 在多种视频生成任务中验证了 TempoControl 的有效性，包括单个和多个对象的时间重排序、动作对齐生成以及音频对齐生成，实现了精确的时序控制同时保持高质量和多样性。 Conclusion: TempoControl 提供了一种灵活且高效的方式，在不修改模型训练过程的前提下实现生成视频中视觉概念的精细时间控制，拓展了生成视频模型的应用潜力。 Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.

[152] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Sicheng Feng,Kaiwen Tuo,Song Wang,Lingdong Kong,Jianke Zhu,Huan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为RewardMap的多阶段强化学习框架，用于提升多模态大语言模型在细粒度视觉推理任务中的表现，特别是在交通地图等复杂场景下的空间推理能力。

Details

Motivation: 现有的多模态大语言模型在结构化、信息密集的视觉推理任务（如交通图）上表现不佳，且标准强化学习因奖励稀疏和优化不稳定而受限。 Method: 构建了带有密集奖励信号的扩展数据集ReasonMap-Plus，并提出了RewardMap框架，包含难度感知奖励设计和从感知到复杂推理的多阶段训练策略。 Result: 在ReasonMap和ReasonMap-Plus上的实验表明，RewardMap各组件均带来性能提升，组合使用效果最佳；在6个基准上平均提升3.47%，展现出更强的视觉理解与推理能力。 Conclusion: RewardMap通过密集奖励和渐进式训练有效提升了MLLM在细粒度视觉推理任务上的性能，具有良好的泛化能力和应用前景。 Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

[153] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Zihan Zhou,Shilin Lu,Shuli Leng,Shaocong Zhang,Zhuming Lian,Xinlei Yu,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: 本文提出了DragFlow，首个有效利用FLUX强大生成先验进行基于拖拽的图像编辑的框架，通过引入区域化编辑范式、预训练适配器和多模态大模型，显著提升了编辑效果。

Details

Motivation: 早期基于UNet的扩散模型在拖拽编辑中因潜在空间先验不足导致目标区域失真，而新兴的DiT架构虽具备更强先验，但尚未被有效用于拖拽编辑任务。 Method: 提出区域化编辑范式，采用仿射变换提供更一致的特征监督；结合IP-Adapter增强主体一致性，使用梯度掩码保持背景保真，并利用多模态大语言模型解决任务歧义。 Result: 在DragBench-DR和新构建的ReD Bench上实验表明，DragFlow优于点基和区域基线方法，实现了当前最优性能。 Conclusion: DragFlow成功将DiT架构的强大先验应用于拖拽编辑，通过区域化监督和多模块协同设计，显著提升了编辑质量与鲁棒性。 Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

[154] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Guangyu Sun,Archit Singhal,Burak Uzkent,Mubarak Shah,Chen Chen,Garin Kessler

Main category: cs.CV

TL;DR: 提出F2C方法，通过选择关键视频片段（key clips）而非单帧，并结合自适应分辨率策略，在固定计算预算下提升视频理解性能。

Details

Motivation: 现有视频大模型因“海量数据中找关键信息”问题受限，仅选关键帧会丢失时序动态，影响对运动和事件连续性的理解。 Method: 将帧选择扩展到时间连贯的短片段（key clips），并采用自适应分辨率策略动态平衡空间分辨率与片段长度，保持每段视频的token数量恒定。 Result: 在Video-MME、LongVideoBench和MLVU三个长视频基准上，分别比均匀采样提升8.1%、5.6%和10.3%。 Conclusion: 保留时序一致性对视频理解至关重要，F2C为扩展视频大模型至实际应用提供了有效且无需训练的解决方案。 Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

[155] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

Mario Medrano-Paredes,Carmen Fernández-González,Francisco-Javier Díaz-Pernas,Hichem Saoudi,Javier González-Alonso,Mario Martínez-Zarzuela

Main category: cs.CV

TL;DR: 本研究比较了单目视频3D人体姿态估计模型与惯性测量单元（IMU）在真实场景下进行人体运动学评估的性能，使用VIDIMU数据集，结果表明MotionAGFormer表现最优，视频和IMU技术均具可行性，但各有成本、可及性和精度的权衡。

Details

Motivation: 为了实现非实验室环境下对人体运动的准确评估，推动远程医疗、运动科学和康复应用的发展，需要比较新兴的视频姿态估计模型与传统IMU技术的性能。 Method: 利用包含13种临床相关日常活动的VIDIMU数据集，采集健康受试者在普通摄像头和五个IMU传感器下的动作数据；采用OpenSim逆向运动学将IMU数据转化为关节角度，并与基于深度学习的视频模型（MotionAGFormer、MotionBERT、MMPose 2D-to-3D、NVIDIA BodyTrack）输出的关节角度进行对比，评估指标包括RMSE、MAE、Pearson相关系数和R²。 Result: MotionAGFormer表现最佳，整体RMSE为9.27°±4.80°，MAE为7.86°±4.18°，Pearson相关系数为0.86±0.15，R²为0.67±0.28；视频和IMU两种技术均可用于实验室外的运动学分析，但在成本、可访问性和精度方面存在权衡。 Conclusion: 现成的视频姿态估计模型（尤其是MotionAGFormer）在健康成人中已具备临床潜力，可用于远程健康监测，但其在病理人群中的适用性仍需验证；本研究为开发经济、可靠且用户友好的远程医疗系统提供了重要参考。 Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

[156] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang,Dong Liang,Yihang Zhou

Main category: cs.CV

TL;DR: 提出NeuroSwift，一种基于扩散模型的跨被试视觉刺激重建方法，通过结合AutoKL和CLIP适配器分别捕捉低层特征和语义信息，并实现高效微调，在轻量级GPU上达到SOTA性能。

Details

Motivation: 解决现有方法在跨被试fMRI视觉重建中因神经表征差异和大脑对复杂视觉输入的抽象编码而导致的准确性低和计算成本高的问题。 Method: 引入NeuroSwift，集成AutoKL（低层特征）和CLIP（语义）适配器；使用Stable Diffusion生成图像与COCO标题配对训练CLIP Adapter以模拟高级视觉皮层编码；预训练后仅微调17%参数（全连接层）实现跨被试泛化。 Result: 在仅使用三块RTX 4090 GPU、每被试训练一小时的情况下，实现了最先进的跨被试视觉重建性能，显著优于现有方法。 Conclusion: NeuroSwift通过模块化适配器设计和高效微调策略，有效解决了跨被试fMRI解码中的个体差异与计算开销问题，推动了基于扩散模型的脑-图重建技术的发展。 Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

[157] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Sathira Silva,Eman Ali,Chetan Arora,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出microCLIP，一种用于细粒度图像分类的自训练框架，通过引入Saliency-Oriented Attention Pooling（SOAP）和TokenFusion模块，结合LLM生成的文本先验与CLIP的视觉表示，提升无监督适应性能。

Details

Motivation: CLIP模型在细粒度图像分类任务中表现受限，因其依赖全局粗粒度特征，缺乏对局部细微线索的敏感性。现有方法未能充分实现空间精确的细粒度对齐，因此需要更精细的机制来挖掘潜在的局部语义信息。 Method: 提出microCLIP框架，核心是TokenFusion模块中的Saliency-Oriented Attention Pooling（SOAP），从图像块嵌入生成显著性引导的[FG] token，并与全局[CLS] token融合以实现粗-细对齐；采用双头LLM衍生分类器，一个冻结用于稳定伪标签生成，另一个可学习并微调；引入动态知识聚合，将固定的LLM/CLIP先验与TokenFusion的动态logits凸组合，迭代优化伪标签。 Result: 在13个细粒度分类基准上平均准确率提升了2.90%，且仅需轻量级适配，无需额外标注。 Conclusion: microCLIP有效增强了CLIP在无监督场景下的细粒度分类能力，通过显著性引导的token融合与动态知识聚合，实现了对局部细微特征的敏感建模，同时保持了良好的稳定性与泛化性。 Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

[158] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park,Yifan Yang,Juheon Yi,Shicheng Zheng,Yifei Shen,Dongqi Han,Caihua Shan,Muhammad Muaz,Lili Qiu

Main category: cs.CV

TL;DR: VidGuard-R1是首个基于多模态大语言模型的视频真实性检测器，通过组相对策略优化（GRPO）实现高精度和可解释性，显著优于现有方法。

Details

Motivation: 随着AI生成视频的快速发展，亟需有效的检测工具来应对虚假信息和社会风险，同时要求模型具备可解释性以增强透明度。 Method: 提出VidGuard-R1，采用多模态大语言模型Qwen-VL，结合GRPO方法进行微调，并设计两个专门的奖励模型以捕捉时间伪影和生成复杂性；构建包含14万真实与AI生成视频的高质量数据集。 Result: 在多个基准上实现最先进的零样本检测性能，经额外训练后准确率超过95%，并通过案例研究验证其推理过程具有高可解释性和准确性。 Conclusion: VidGuard-R1在AI生成视频检测方面表现出卓越性能，兼具高准确率与可解释性，为应对AI生成内容带来的社会风险提供了有效工具。 Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.

[159] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文提出了一种简单而有效的方法，通过利用教师模型的知识来指导学生模型生成长视频，避免了在连续潜在空间中误差累积导致的质量下降问题，无需依赖长视频教师模型监督或重新训练。

Details

Motivation: 现有基于Transformer的扩散模型在长视频生成上计算成本高，且依赖无法生成长视频的教师模型进行蒸馏会导致外推时质量显著下降。 Method: 提出一种自强制增强方法，利用教师模型对自生成长视频中的采样片段提供指导，从而在不重训或依赖长视频教师的情况下提升生成质量与一致性。 Result: 该方法可将视频长度扩展至教师模型能力的20倍以上，最长生成达4分15秒（超过基线模型50倍），并在标准和改进的基准上显著优于基线方法。 Conclusion: 所提方法有效缓解了长时域视频生成中的质量退化问题，在保持时间一致性的同时大幅提升了生成长度和视觉保真度。 Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

[160] Learning to Generate Object Interactions with Physics-Guided Video Diffusion

David Romero,Ariana Bermudez,Hao Li,Fabio Pizzati,Ivan Laptev

Main category: cs.CV

TL;DR: 本文提出了KineMask，一种物理引导的视频生成方法，通过结合低层次运动控制和高层次文本条件，在真实场景中实现了更逼真的刚体运动、交互和动力学效果。

Details

Motivation: 现有视频生成模型在物理合理的物体交互和物理基础控制方面仍存在不足，限制了其在机器人和具身决策等领域的应用。 Method: 提出KineMask，采用两阶段训练策略，利用合成场景中的简单交互训练视频扩散模型，并通过对象掩码逐步去除未来运动监督；结合指定物体速度与预测性场景描述实现低层运动控制与高层文本条件的融合。 Result: 在真实场景中显著提升了物体交互的 realism 和可控性，实验表明KineMask优于同规模的最新模型，消融研究验证了高低层条件的互补作用。 Conclusion: KineMask有效实现了物理引导的视频生成，为视频模型作为世界模拟器在复杂动态场景中的应用提供了新路径。 Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.

[161] MultiModal Action Conditioned Video Generation

Yichen Li,Antonio Torralba

Main category: cs.CV

TL;DR: 本文提出了一种用于精细多模态动作的特征学习范式，通过融合本体感觉、动觉、力触觉和肌肉激活等多模态感知，提升机器人对复杂交互动态的模拟能力。

Details

Motivation: 当前视频模型作为世界模型存在缺乏精细控制的问题，而通用家庭机器人需要实时精细运动控制来处理精细任务和紧急情况。 Method: 引入细粒度多模态动作，结合多种感官信息（如本体感觉、动觉、力触觉和肌肉激活），设计一种保持各模态独特信息的同时对齐这些模态的特征学习范式，并提出正则化方案以增强动作轨迹特征的因果性。 Result: 实验表明，融入多模态感知可提高模拟精度并减少时间漂移，消融实验和下游应用验证了方法的有效性和实用性。 Conclusion: 该方法有效提升了机器人在复杂交互中的精细控制能力，为构建更实用的通用家庭机器人提供了可行的技术路径。 Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

[162] VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song,Wenhao Chai,Shusheng Yang,Ethan Armand,Xiaojun Shan,Haiyang Xu,Jianwen Xie,Zhuowen Tu

Main category: cs.CV

TL;DR: VideoNSA 是一种将原生稀疏注意力（NSA）应用于视频-语言模型的方法，通过在216K视频指令数据集上对Qwen2.5-VL进行端到端训练，提升了长视频理解、时序推理和空间任务的性能。

Details

Motivation: 现有视频理解模型受限于上下文长度，难以捕捉关键过渡帧并维持长时间跨度的一致性。 Method: 提出 VideoNSA，采用硬件感知的混合注意力机制：文本使用密集注意力，视频使用原生稀疏注意力（NSA），并在大规模数据集上进行端到端训练。 Result: 相比基于token压缩和无需训练的稀疏基线方法，VideoNSA 在长视频理解、时序推理和空间基准测试中表现更优；可稳定扩展至128K tokens，并发现最优的全局-局部注意力分配、任务相关的分支使用模式以及可学习的稀疏注意力能形成动态注意力汇聚点。 Conclusion: VideoNSA 有效解决了视频理解中的长上下文建模问题，展示了稀疏注意力在多模态大模型中的可扩展性与实用性。 Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

[163] NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez

Main category: cs.CV

TL;DR: 本文提出了一种名为NoiseShift的训练-free方法，通过根据图像分辨率重新校准去噪器的噪声水平，解决了文本到图像扩散模型在不同分辨率下生成质量不一致的问题。

Details

Motivation: 现有的高分辨率文本到图像生成模型在低分辨率生成时表现不佳，主要由于噪声调度器在不同分辨率下的感知效应不均，导致训练与测试间的不匹配。 Method: NoiseShift通过调整噪声水平来适应不同分辨率，无需修改模型架构或采样调度，适用于现有模型。 Result: 在Stable Diffusion 3、3.5和Flux-Dev上应用NoiseShift后，低分辨率图像生成质量显著提升，在LAION-COCO和CelebA数据集上的FID指标均有明显改善。 Conclusion: NoiseShift有效缓解了分辨率依赖的伪影问题，提升了低分辨率图像生成的质量，具有良好的通用性和实用性。 Abstract: Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

[164] Inferring Dynamic Physical Properties from Video Foundation Models

Guanqi Zhan,Xianzheng Ma,Weidi Xie,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文研究从视频中预测动态物理属性（如弹性、粘度和动态摩擦）的任务，提出了新的合成与真实数据集，并比较了基于视觉提示、预训练视频模型和多模态大语言模型的三种推理方法。

Details

Motivation: 许多物理属性需要时序信息才能推断，现有方法在真实场景下泛化能力有限，因此需要探索基于视频基础模型和多模态大模型的有效推理方式。 Method: 构建了针对弹性、粘度和摩擦力的新视频数据集，采用三种方法进行属性预测：(a) 使用传统计算机视觉提取视觉线索的oracle方法；(b) 基于预训练视频生成与自监督模型的可学习视觉提示机制；(c) 针对多模态大语言模型的提示策略。 Result: 生成式或自监督训练的视频基础模型表现相近，虽逊于oracle方法，但优于当前的多模态大语言模型；合适的提示策略可提升MLLM的性能。 Conclusion: 视频基础模型在预测动态物理属性方面具有潜力，而当前MLLM在此类任务上仍有不足，提示工程有助于提升其表现。 Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

[165] Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

Mengyu Yang,Yiming Chen,Haozheng Pei,Siddhant Agarwal,Arun Balajee Vasudevan,James Hays

Main category: cs.CV

TL;DR: 提出了一种声音物体检测任务，通过多模态对象感知框架，利用自我中心视频中的声音与视觉信息，结合自动分割掩码和槽注意力机制，实现对交互中发声物体的识别。

Details

Motivation: 探索模型是否能将日常物体交互产生的声音与直接参与的物体关联起来，模拟人类感知能力。 Method: 构建一个基于自我中心视频的多模态框架，使用自动管道生成交互对象的分割掩码以引导训练，并采用槽注意力视觉编码器强化对象中心先验。 Result: 在新提出的声音物体检测任务以及现有的多模态动作理解任务上均实现了最先进的性能。 Conclusion: 该方法有效提升了模型对声音与参与物体之间关联的理解能力，推动了对象中心的多模态学习发展。 Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

[166] StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Bo-Hsu Ke,You-Zhe Xie,Yu-Lun Liu,Wei-Chen Chiu

Main category: cs.CV

TL;DR: 本文提出了一种针对3D高斯点阵（3DGS）的密度引导型图像级投毒攻击方法，通过在低密度区域注入高斯点并结合自适应噪声策略，实现从特定视角生成隐蔽的虚假物体，同时最小化对正常视角的影响。

Details

Motivation: 随着NeRF和3DGS等3D场景表示方法在新视角合成中的广泛应用，其安全性问题日益重要。现有研究缺乏对3DGS在图像级投毒攻击下的鲁棒性分析，因此需要系统评估其脆弱性。 Method: 提出密度引导的投毒方法：利用核密度估计（KDE）识别低密度区域，在其中战略性地注入高斯点以嵌入视角相关的虚假物体；引入自适应噪声策略破坏多视角一致性，增强攻击效果；并设计基于KDE的评估协议以客观衡量攻击难度。 Result: 实验表明，所提方法在攻击有效性上优于现有最先进方法，能在目标视图中清晰呈现虚假物体，同时在非目标视图中保持隐蔽性，且提出的KDE评估协议有助于未来研究的标准化 benchmarking。 Conclusion: 该工作揭示了3DGS在对抗性投毒攻击下的安全隐患，提出了高效且隐蔽的攻击方法，并为评估此类攻击提供了可复现、可比较的评估框架，推动了3D场景表示安全性的研究。 Abstract: 3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

[167] Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Eric Tillmann Bill,Enis Simsar,Thomas Hofmann

Main category: cs.CV

TL;DR: 本文提出了首个用于提升文本到图像模型中多主体保真度的理论框架，通过将流匹配与随机最优控制结合，实现了对采样动态的有效引导。

Details

Motivation: 现有文本到图像模型在处理多主体描述时存在属性泄露、身份纠缠和主体遗漏等问题，缺乏系统性的优化目标。 Method: 将流匹配视为随机最优控制问题，提出两种算法：无需训练的测试时控制器和轻量级微调方法Adjoint Matching，并引入FOCUS方法实现主体解耦。 Result: 在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上验证了方法的有效性，显著提升多主体对齐能力，同时保持原有模型风格；测试时控制效率高，微调后的控制器具有良好泛化性。 Conclusion: 该工作为多主体生成提供了可优化的理论框架，统一了解耦控制与现有注意力机制，并首次提出专为多主体保真设计的微调路径。 Abstract: Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

Table of Contents

cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[2] Towards Open-Ended Discovery for Low-Resource NLP

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

[12] Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition

[13] LLMRank: Understanding LLM Strengths for Model Routing

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

[16] Silent Tokens, Loud Effects: Padding in LLMs

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

[31] Longitudinal Monitoring of LLM Content Moderation of Social Issues

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

[57] How Do Language Models Compose Functions?

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

[60] Machine-interpretable Engineering Design Standards for Valve Specification

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

[64] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

[69] Inverse Language Modeling towards Robust and Grounded LLMs

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

[72] Exploring Database Normalization Effects on SQL Generation

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

[78] The Disparate Impacts of Speculative Decoding