cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: 提出了一种通过上下文概念学习实验揭示大语言模型中对量化词向上单调性偏好的数据集，表明该方法能有效发现模型中的隐性偏差。

Details

Motivation: 揭示大语言模型在概念学习任务中的隐性偏差，尤其是对量化词的偏好。 Method: 构建概念学习任务数据集，并通过上下文概念学习实验测试模型对量化词的单调性偏好。 Result: 发现语言模型存在对向上单调性的偏好，且这种偏见在直接提示下不明显。 Conclusion: 上下文概念学习是发现语言模型隐性偏差的有效方法。 Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[2] Towards Open-Ended Discovery for Low-Resource NLP

Bonaventure F. P. Dossou,Henri Aïdasso

Main category: cs.CL

TL;DR: 本文提出一种面向低资源语言的交互式语言发现新范式，主张通过人机协作的动态对话而非静态数据集来学习语言，推动语言技术从抽取式数据收集转向参与式、共适应的学习过程。

Details

Motivation: 低资源语言由于缺乏文本语料、标准化正字法和可扩展的标注流程而发展受限，现有大模型依赖大规模集中数据，难以惠及边缘化语言群体。 Method: 提出一个基于人机共同不确定性的框架，结合模型的认知不确定性与人类说话者的犹豫信号和置信提示，指导交互、问题选择和记忆保留。 Result: 该框架支持在对话中动态学习新语言，促进不确定性驱动的语言发现，提升对低资源语言的可及性与包容性。 Conclusion: 未来语言技术应转向以人为中心、互动合作的人机共学模式，尊重并赋能语言社区，在保护全球语言多样性的同时实现真正的语言发现。 Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world's linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

Bertrand Kian Hassani,Yacoub Bahini,Rizwan Mushtaq

Main category: cs.CL

TL;DR: 该研究利用微调的大语言模型（LLMs）构建多维度框架，评估828家美国上市公司的气候信息披露成熟度，发现当前披露存在承诺与实际行动脱节、模仿性报告普遍等问题，强调需加强监管以提升信息的可比性和决策价值。

Details

Motivation: 应对气候变化背景下企业气候信息披露需求上升，但普遍存在象征性报告和模仿行为削弱了其实际价值，亟需有效方法评估披露质量。 Method: 通过针对气候沟通微调的大语言模型，构建四个分类器（情感、承诺、具体性、目标雄心），从企业可持续发展报告和年报中提取叙述性指标，并将其与企业排放量、市值和行业等特征关联分析。 Result: （1）风险导向的叙述常与明确承诺一致，但量化目标（如净零承诺）与语调脱节；（2）规模大、排放高的企业披露更多承诺和行动，但与量化目标不一致；（3）披露风格高度相似，显示模仿行为普遍，降低了差异性和决策有用性。 Conclusion: 大语言模型有助于ESG叙述分析，但需更强监管将气候承诺与可验证的转型策略挂钩，以提升信息披露的实际效能。 Abstract: Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 U.S.listed firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

Tyler J Poore,Christopher J Pinard,Aleena Shabbir,Andrew Lagree,Andre Telfer,Kuan-Chuen Wu

Main category: cs.CL

TL;DR: 该研究评估了三种商用兽医领域大语言模型（LLM）摘要工具在兽医肿瘤病例记录上的表现，发现专注于兽医领域的工具（Product 1）在准确性、完整性等方面显著优于其他产品，并验证了使用LLM作为评判者的评估框架具有良好的可重复性和可扩展性。

Details

Motivation: 尽管大型语言模型（LLMs）在临床环境中应用日益广泛，但其在兽医学领域的表现仍缺乏系统评估，特别是针对兽医专用LLM工具的性能尚不明确。 Method: 研究采用基于评分标准的“LLM-as-a-judge”框架，对三种商用兽医LLM摘要工具（Product 1 [Hachiko]、Product 2 和 Product 3）在标准化兽医肿瘤病例数据集上的摘要输出进行评估，评分维度包括事实准确性、完整性、时间顺序、临床相关性和组织结构，并通过三次独立评分评估评分框架的一致性。 Result: Product 1整体表现最佳，平均中位得分为4.61（IQR: 0.73），显著高于Product 2（2.55）和Product 3（2.45），且在事实准确性和时间顺序上获得满分中位数；LLM评分者表现出高可重复性，各产品平均得分标准差分别为0.015、0.088和0.034。 Conclusion: 研究结果强调了开发兽医专用LLM工具的重要性，并证明LLM-as-a-judge是一种可扩展且可重复的兽医临床自然语言处理摘要评估方法。 Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

Akshith Reddy Putta,Jacob Devasier,Chengkai Li

Main category: cs.CL

TL;DR: ClaimCheck是一个基于小语言模型的透明、分步自动事实核查系统，通过模拟人类核查流程，在降低计算需求的同时实现了76.4%的SOTA准确率。

Details

Motivation: 现有事实核查系统依赖大模型和静态知识库，成本高且缺乏透明度，难以普及。 Method: 设计一个模块化流水线，包括搜索查询规划、基于Web的证据检索与摘要、证据合成与再检索、以及结论评估，各模块针对小语言模型优化。 Result: 在AVeriTeC数据集上达到76.4%的准确率，超过使用LLaMA3.1 70B和GPT-4o的先前方法，且计算资源需求显著降低。 Conclusion: 精心设计的模块化架构和提示策略可有效弥补小语言模型的能力局限，实现高效、可解释且易于推广的事实核查系统。 Abstract: We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at https://idir.uta.edu/claimcheck.

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

Nicole N Khatibi,Daniil A. Radamovich,Michael P. Brenner

Main category: cs.CL

TL;DR: 本文质疑当前大语言模型在数学基准测试中表现出的高水平能力，指出现有基准可能存在数据污染和问题类型局限性，并提出一个新的基准EEFSUVA，以更全面地评估模型的数学推理能力。

Details

Motivation: 现有数学基准主要来自国际数学奥林匹克等常见竞赛，可能存在数据泄露和过度拟合问题，难以真实反映模型的数学推理能力。 Method: 构建一个名为EEFSUVA的新基准，该基准源自东欧及前苏联国家较少传播的地区性和全国性奥赛题目，具有与IMO相当的难度但更具非标准解题挑战。 Result: 初步结果显示，最先进的大语言模型在EEFSUVA上的表现显著下降，表明当前模型在面对新颖、非常规问题时能力有限。 Conclusion: 需要更广泛、更多样化的评估数据集来准确衡量数学推理能力，并指导未来大语言模型的发展方向。 Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

Main category: cs.CL

TL;DR: 大型语言模型应遵循系统提示优先于用户输入的层级指令，但研究表明它们常忽略此规则而更服从社会性线索（如权威或共识）。本文通过大规模数据集的机制分析发现，系统-用户冲突与社会性冲突在早期即形成独立表征空间，尽管模型能更强地检测系统-用户冲突，但仅对社会性线索保持一致响应；操纵实验显示社会性向量以角色无关的方式增强指令遵循，揭示了系统服从的脆弱性，并呼吁开发轻量级的、层级敏感的对齐方法。

Details

Motivation: 研究大型语言模型在处理层级指令时的行为偏差，特别是为何模型倾向于忽视系统提示而响应社会性线索，从而理解其内在机制。 Method: 采用线性探测、直接logit归因和 steering 实验，在大规模数据集上分析模型内部对系统-用户冲突和社会性冲突的表征与处理机制。 Result: 发现系统-用户冲突和社会性冲突在模型早期即编码为不同子空间；系统-用户冲突被更强检测到，但决策仅对社会性线索一致；社会性向量可增强指令遵循且具有角色无关性。 Conclusion: 大型语言模型的系统服从性脆弱，主要受社会性线索驱动；需设计轻量级、层级敏感的对齐方法以纠正此偏差。 Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

Dimitar Peshevski,Kiril Blazhevski,Martin Popovski,Gjorgji Madjarov

Main category: cs.CL

TL;DR: 提出一种利用大语言模型生成合成查询和标注数据的管道，用于训练小型变换器模型进行文档重排序，从而在降低计算成本的同时保持良好的重排序性能。

Details

Motivation: 大语言模型虽然在重排序任务中表现优异，但计算成本高；而小型模型依赖稀缺的人工标注数据，限制了其应用。 Method: 使用大语言模型从领域语料库生成合成查询，并通过基于大语言模型的分类器标注正例和难负例样本，然后利用这些合成数据结合局部对比估计（LCE）损失进行对比学习，微调小型变换器模型。 Result: 在MedQuAD数据集上的实验表明，该方法显著提升了领域内性能，并在跨领域任务中表现出良好的泛化能力。 Conclusion: 通过将大语言模型用于数据生成和监督而非推理，可在大幅降低计算成本的同时保持强大的重排序能力。 Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: 该研究通过PHATE流形分析系统地探究了中文字符嵌入中的几何模式，发现实词呈现聚类模式，虚词呈现分支模式，且几何复杂性与语义内容相关。

Details

Motivation: 探索中文字符在嵌入空间中的几何结构是否反映其语义和语言学特性。 Method: 结合七种嵌入模型和八种降维方法，使用PHATE流形分析对1000多个汉字在12个语义领域的几何模式进行交叉验证，并进行子网络分析。 Result: 实词在嵌入空间中形成聚类，虚词呈现分支结构；有意义的字符几何多样性丰富，而构字部首则聚集成紧密簇；短语层级结构显示语义从基础字符系统扩展。 Conclusion: 研究结果为传统语言学理论提供了计算支持，并建立了语义组织的几何分析新框架。 Abstract: We systematically investigate geometric patterns in Chinese character embeddings using PHATE manifold analysis. Through cross-validation across seven embedding models and eight dimensionality reduction methods, we observe clustering patterns for content words and branching patterns for function words. Analysis of over 1000 Chinese characters across 12 semantic domains reveals that geometric complexity correlates with semantic content: meaningful characters exhibit rich geometric diversity while structural radicals collapse into tight clusters. The comprehensive child-network analysis (123 phrases) demonstrates systematic semantic expansion from elemental character. These findings provide computational evidence supporting traditional linguistic theory and establish a novel framework for geometric analysis of semantic organization.

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

Shuaidong Pan,Di Wu

Main category: cs.CL

TL;DR: 提出一种结合不确定性量化和风险感知机制的大语言模型框架，以提高高风险场景下自动摘要的可靠性。

Details

Motivation: 应对信息过载和高风险决策中对可靠自动摘要的需求，避免传统模型过度自信的预测问题。 Method: 构建基于条件生成的摘要模型，引入贝叶斯推断建模参数空间不确定性，使用预测分布熵衡量生成内容的不确定性，并通过熵正则化与风险感知损失联合优化，同时集成风险评分与调控模块。 Result: 实验表明该方法在保持流畅性和语义完整性的前提下，显著提升了高风险应用中摘要的鲁棒性和可靠性。 Conclusion: 该研究为可信摘要提供了系统性解决方案，在方法论上具有可扩展性和实际应用价值。 Abstract: This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim,Gyuho Shim,Yongchan Chun,Minhyuk Kim,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了“基准分析”（Benchmark Profiling）框架，通过能力影响评分（AIS）将基准性能分解为十种认知能力，揭示现有基准测试往往混合多种能力而非单一技能，解释了模型性能提升不等于实际能力增强的原因。

Details

Motivation: 现有基准测试得分容易高估模型真实能力，因缺乏对任务所需具体能力的系统性验证，难以判断模型究竟掌握了哪些技能。 Method: 结合基于梯度的重要性评分与针对性参数消融法，提出能力影响评分（AIS），量化十种认知能力对模型在基准上表现的贡献。 Result: 分析发现：大多数基准依赖多种能力而非单一技能；标签相似的数据集实际依赖的能力组合不同；代码生成类任务受益于多技能提升但对特定微调响应有限；无关能力可能负面影响性能。 Conclusion: Benchmark Profiling 提供了一种透明的工具，用于审计基准和解释模型行为，有助于理解性能提升与实际能力之间的差距。 Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Boddu Sri Pavan,Boddu Swathi Sree

Main category: cs.CL

TL;DR: 本研究提出了一种计算社会科学方法，用于保护泰卢固语诗歌韵律传统Chandassu，开发了首个分析泰卢固语韵律模式的综合数字框架。

Details

Motivation: 保护面临消失风险的泰卢固语诗歌传统Chandassu，保存其蕴含的集体文化智慧。 Method: 结合社区知识与现代计算技术，构建包含4651个标注诗句的数据集，设计了AksaramTokenizer、LaghuvuGuruvu Generator和PadyaBhedam Checker等工具进行音步切分、轻重音节分类和模式识别。 Result: 所提算法在Chandassu Score上达到91.73%的准确率，评估指标符合传统文学标准。 Conclusion: 该研究表明计算社会科学可有效保护濒危文化知识体系，并为以社区为中心的文化遗产数字化保护提供了可推广的方法论。 Abstract: This research presents a computational social science approach to preserving Telugu Chandassu, the metrical poetry tradition representing centuries of collective cultural intelligence. We develop the first comprehensive digital framework for analyzing Telugu prosodic patterns, bridging traditional community knowledge with modern computational methods. Our social computing approach involves collaborative dataset creation of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design. The framework includes AksharamTokenizer for prosody-aware tokenization, LaghuvuGuruvu Generator for classifying light and heavy syllables, and PadyaBhedam Checker for automated pattern recognition. Our algorithm achieves 91.73% accuracy on the proposed Chandassu Score, with evaluation metrics reflecting traditional literary standards. This work demonstrates how computational social science can preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage. The methodology offers insights for community-centered approaches to cultural preservation, supporting broader initiatives in digital humanities and socially-aware computing systems.

[13] LLMRank: Understanding LLM Strengths for Model Routing

Shubham Agrawal,Prasang Gupta

Main category: cs.CL

TL;DR: LLMRank 是一种基于提示特征的路由框架，通过多维度提示特征和轻量级代理求解器信号，实现对大语言模型的高效选择，在保持高实用性的同时提升可解释性。

Details

Motivation: 在大语言模型快速发展的背景下，如何在性能与效率之间取得平衡，针对不同提示选择最合适的模型成为部署中的关键挑战。 Method: 提出 LLMRank 框架，从提示中提取任务类型、推理模式、复杂度指标、句法线索及轻量级代理求解器信号等可读特征，使用神经排序模型在包含36,497个提示的数据集 RouterBench 上训练，预测每个模型的效用并进行路由决策。 Result: LLMRank 在多个基准和11个先进大模型上验证，最高可达 oracle 效用的89.2%，且提供可解释的特征归因，优于仅依赖潜在嵌入的单次路由方法。 Conclusion: 多维度特征提取与混合排序目标显著提升了模型路由的效果与透明度，表明基于特征驱动的路由策略在高效、可解释的 LLM 部署中具有重要潜力。 Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque

Main category: cs.CL

TL;DR: 提出了一种资源高效的多阶段训练方法DermIQ-VLM，用于提升视觉语言模型在皮肤科诊断中的结构化推理能力。

Details

Motivation: 现有视觉语言模型在医疗图像分析中受限于数据稀缺和高计算成本，难以实现复杂的结构化推理。 Method: 提出GRPO++改进版算法，结合监督微调与基于知识图谱的直接偏好优化（DPO）进行模型对齐，模拟皮肤科医生的诊断过程。 Result: 在皮肤病数据集上的初步评估显示，该方法显著优于标准微调方法。 Conclusion: 所提出的训练 pipeline 能有效提升VLM在资源受限环境下的专业性和可靠性，具有临床应用潜力。 Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist's diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

Nandakishor M

Main category: cs.CL

TL;DR: 提出一种基于置信度感知的路由系统，通过在生成前主动评估模型不确定性来减少大语言模型的幻觉问题，相比传统事后修正方法显著提升准确率并降低计算成本。

Details

Motivation: 大语言模型容易产生事实性错误（幻觉），现有方法多为生成后的修正，计算开销大且无法预防不可靠内容的生成。因此需要一种更高效、前置的解决方案。 Method: 结合语义对齐、层间收敛性分析和学习到的置信度估计三种信号，构建统一的置信度评分，并据此将查询路由到四个不同路径：本地生成、检索增强生成、更大模型或人工审核。 Result: 在知识密集型问答基准上，幻觉检测AUC从0.42提升至0.74，F1分数从0.61提升至0.82，误报率低（0.09），同时计算成本降低40%。 Conclusion: 从被动修正转向主动评估的范式转变，提供了一种计算效率高且有效提升大模型可靠性的新方法。 Abstract: Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.

[16] Silent Tokens, Loud Effects: Padding in LLMs

Rom Himelstein,Amit LeVi,Yonatan Belinkov,Avi Mendelson

Main category: cs.CL

TL;DR: 研究表明，填充标记（padding tokens）在大语言模型中的处理不当会对模型的激活、生成质量、偏见和安全性产生负面影响，提示填充并非无害，需在部署中谨慎处理。

Details

Motivation: 尽管填充标记在批处理推理中广泛使用且应被完全屏蔽，但实现错误可能导致其影响模型计算，而这种影响的程度尚不明确。 Method: 研究在Llama、Gemma和Qwen三个开源模型家族中系统地插入可控量的填充标记，并从激活、生成质量、偏见和安全性四个维度评估其影响。 Result: 即使少量填充也会改变隐藏表示，降低小型模型的生成质量，以不可预测的方式改变偏见，并削弱安全防护机制。 Conclusion: 填充标记的处理是一个鲁棒性风险，必须在实际部署中加以重视和妥善管理。 Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Juntae Lee,Jihwan Bang,Seunghan Yang,Simyung Chang

Main category: cs.CL

TL;DR: CIFLEX是一种用于在单个设备上大语言模型中高效处理多轮交互子任务的新型执行系统，通过重用主任务的KV缓存并使用隔离的侧路径注入特定指令来减少计算开销。

Details

Motivation: 随着大语言模型能力的增强，单一模型需要处理多种子任务以更好地支持用户请求，但传统的重新处理整个对话上下文的方法带来了显著的计算开销。 Method: CIFLEX通过重用主任务的键值（KV）缓存，并将任务特定指令注入到隔离的侧路径中来执行子任务；子任务完成后，利用缓存的上下文回滚到主路径，避免了冗余的预填充计算。此外，还开发了一种分层分类策略以支持子任务选择。 Result: 实验表明，CIFLEX在不降低任务性能的情况下显著减少了计算成本，能够在设备上实现可扩展且高效的多任务对话。 Conclusion: CIFLEX通过优化上下文管理和子任务执行流程，有效降低了多轮交互中的计算开销，为在资源受限的设备上运行复杂的大语言模型应用提供了可行方案。 Abstract: We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Hu Wei,Ze Xu,Boyu Yang,Linlin Miao,Weiqi Zhai,Yihan Li,Zixuan Li,Zhijun Wang,Boya Wang,Jianwei Yu,Jialing Yuan,Xiaoyue Zhang,Cheng He,Minglei Chen,Zifan Zhang,Qianhui Li,Wei Wang,Xiang Xu

Main category: cs.CL

TL;DR: 本文提出了两个互补的数学基准测试集SKYLENAGE-ReasoningMATH和SKYLENAGE-MATH，用于评估大语言模型在数学推理任务上的性能。实验结果显示现有模型在高难度题目上仍有明显不足，尤其是从高中到博士级别的递进挑战中表现出性能下降。该工作为未来数学推理能力评估提供了具有丰富元数据、难度校准且覆盖广泛的参考基准。

Details

Motivation: 由于当前大语言模型在公共数学数据集上表现趋近上限（天花板效应），缺乏对数学推理能力的有效区分，因此需要更具挑战性、结构化且涵盖广泛难度和主题的新型基准来准确评估模型的数学推理能力。 Method: 设计了两个互补的基准测试：SKYLENAGE-ReasoningMATH（100题，含长度、数值密度和符号复杂度等元数据）和SKYLENAGE-MATH（150题，涵盖四个教育阶段和七个学科）。在统一设置下评估了十五种主流大语言模型，并分析了模型在不同科目和难度等级下的表现。 Result: 在竞赛风格的SKYLENAGE-MATH上，最强模型得分为44%，次优模型为37%；性能随教育层级升高而下降，顶尖系统从博士到高中的保留率约为79%。在推理集SKYLENAGE-ReasoningMATH上，最佳模型总体准确率为81%，最难子集的结果显示领先模型与中等模型之间存在显著差距。 Conclusion: SKYLENAGE系列基准测试提供了一个高难度、以推理为中心、覆盖面广且具备精细元数据的数学评估工具，能有效缓解现有评测中的天花板效应，可作为未来大语言模型数学能力评估的参考标准。 Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Seyma Yaman Kayadibi

Main category: cs.CL

TL;DR: 提出了一种基于记忆性能结构不对称性的人工智能年龄评分（AAS），用于量化大语言模型在会话重置后的记忆老化现象，实验表明AAS能有效区分语义稳定与情景记忆衰退。

Details

Motivation: 为了量化人工智能系统中因会话重置导致的记忆退化，特别是语义记忆与情景记忆的不对称表现，需要一个理论严谨且跨任务适用的评估指标。 Method: 提出了人工年龄评分（AAS），一种对数尺度、熵驱动的记忆老化度量方法，基于可观察的回忆行为，在不显式估计冗余（R=0）的情况下计算保守上界，并在25天双语实验中测试ChatGPT-5在有状态和无状态交互阶段的表现。 Result: 持续会话中模型保持语义和情景记忆，AAS趋近理论最小值；会话重置后仅保留语义记忆，情景记忆崩溃导致AAS显著上升，表现出结构性记忆老化。 Conclusion: AAS是一个理论健全、任务无关的记忆退化诊断工具，能够有效捕捉人工智能系统中的结构性记忆老化现象，适用于评估各类AI系统的长期记忆行为。 Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann's work on automata, Shannon's theories of information and redundancy, and Turing's behavioral approach to intelligence.

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Yisong Xiao,Aishan Liu,Siyuan Liang,Zonghao Ying,Xianglong Liu,Dacheng Tao

Main category: cs.CL

TL;DR: 提出了一种新的测试时解毒框架ARGRE，通过建模潜在表示空间中的毒性转换路径，实现稳定且精确的奖励引导编辑，显著提升了大语言模型在解毒效果和效率方面的表现。

Details

Motivation: 现有的测试时解毒方法因缺乏对有毒与无毒输出之间转换空间的充分探索，导致干预不够精确，难以有效抑制毒性生成。 Method: 提出ARGRE框架，通过识别无毒语义方向并在潜在空间中插值，构建细粒度的毒性转换轨迹，并利用这些轨迹训练一个自回归奖励模型，指导两步自适应编辑过程：基于奖励差距的方向性引导和轻量级梯度优化。 Result: 在8个主流大语言模型上的实验表明，ARGRE相比现有方法将毒性降低了62.21%，推理时间减少了47.58%，同时保持了原始模型的核心性能。 Conclusion: ARGRE通过显式建模毒性转换路径并引入自回归奖励引导机制，在测试时实现了更精准、高效的解毒，为安全部署大语言模型提供了有效方案。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

Hyeoneui Kim,Jeongha Kim,Huijing Xu,Jinsun Jung,Sunghoon Kang,Sun Joo Jang

Main category: cs.CL

TL;DR: 本研究开发了一个心理压力本体（MeSO），并评估了使用大语言模型（LLM）从叙述性文本中提取本体引导的压力相关信息的可行性。基于理论模型和11种验证工具构建MeSO，并利用LLM从Reddit帖子中提取六类压力信息，准确率达78.2%。结果表明，结合本体的LLM可有效实现压力信息的结构化提取，提升环境AI系统中压力记录的一致性和可用性。

Details

Motivation: 压力对健康影响重大，但在电子健康记录中常以非结构化文本形式存在且记录不全。现有环境AI技术虽能减轻文档负担，但生成的内容多为非结构化，限制了临床应用。因此，需要一种方法将自由文本中的压力信息结构化，以提高其可利用性。 Method: 整合应激交互理论模型与11种经验证的应激评估工具，构建心理应力本体（MeSO），并通过Ontology Pitfall Scanner!和专家评审优化其结构。随后使用Claude Sonnet 4这一大语言模型，从35篇Reddit帖子中提取六类应力相关信息（如应激源、反应、应对策略等），并与人工标注对比评估性能。 Result: 最终本体包含181个概念，涵盖八个顶层类别。在220个可提取的压力相关信息项中，LLM正确识别出172项（78.2%），误分类27项（12.3%），遗漏21项（9.5%）。所有正确提取的信息均能准确映射到MeSO，但仍有24个相关概念未被本体覆盖。 Conclusion: 本研究表明，结合本体的大语言模型能够有效从非结构化文本中提取结构化的压力相关信息，具备提升环境AI系统中压力记录一致性与临床实用性的潜力。未来需在临床对话数据上验证，并比较不同LLM的表现。 Abstract: Stress, arising from the dynamic interaction between external stressors, individual appraisals, and physiological or psychological responses, significantly impacts health yet is often underreported and inconsistently documented, typically captured as unstructured free-text in electronic health records. Ambient AI technologies offer promise in reducing documentation burden, but predominantly generate unstructured narratives, limiting downstream clinical utility. This study aimed to develop an ontology for mental stress and evaluate the feasibility of using a Large Language Model (LLM) to extract ontology-guided stress-related information from narrative text. The Mental Stress Ontology (MeSO) was developed by integrating theoretical models like the Transactional Model of Stress with concepts from 11 validated stress assessment tools. MeSO's structure and content were refined using Ontology Pitfall Scanner! and expert validation. Using MeSO, six categories of stress-related information--stressor, stress response, coping strategy, duration, onset, and temporal profile--were extracted from 35 Reddit posts using Claude Sonnet 4. Human reviewers evaluated accuracy and ontology coverage. The final ontology included 181 concepts across eight top-level classes. Of 220 extractable stress-related items, the LLM correctly identified 172 (78.2%), misclassified 27 (12.3%), and missed 21 (9.5%). All correctly extracted items were accurately mapped to MeSO, although 24 relevant concepts were not yet represented in the ontology. This study demonstrates the feasibility of using an ontology-guided LLM for structured extraction of stress-related information, offering potential to enhance the consistency and utility of stress documentation in ambient AI systems. Future work should involve clinical dialogue data and comparison across LLMs.

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Runfei Chen,Shuyang Jiang,Wei Huang

Main category: cs.CL

TL;DR: 提出SeMob，一种基于大语言模型的语义合成框架，通过多智能体系统从在线文本中提取时空相关信息，并结合创新的渐进融合架构提升动态出行预测性能。

Details

Motivation: 现有时空模型难以利用描述外部事件的文本信息，导致在突发事件影响下的出行预测效果不佳。 Method: 采用基于大语言模型的多智能体框架，自动提取和推理复杂在线文本中的时空相关事件，并通过提出的渐进融合架构将细粒度上下文与时空数据结合。 Result: 在构建的数据集上评估，SeMob相比传统时空模型最大降低13.92% MAE和11.12% RMSE，在事件发生时空邻近区域表现尤为突出。 Conclusion: SeMob有效融合文本语义与时空数据，显著提升突发事件下的出行预测准确性，具有更强的情境对齐能力。 Abstract: Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event's location and time of occurrence.

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Jiaqing Xie

Main category: cs.CL

TL;DR: 提出基于稀疏自编码器（SAE）的top-1潜变量与token-wise衰减引导策略，有效提升语言模型在数学推理中的表现，并优于均值激活差异方法。

Details

Motivation: 现有基于top-k SAE潜变量的引导方法常捕获标点等非语义特征，且恒定引导易导致输出重复，缺乏语义精准性和生成稳定性。 Method: 采用单一最相关SAE潜变量（top-1）进行引导，并设计token-wise衰减的引导策略，避免冗余特征干扰和生成退化。 Result: 在数学推理任务上，该方法显著提升推理质量，效果类似添加引导token，并在多个基准上优于或媲美均值激活差异方法。 Conclusion: 聚焦语义相关的top-1 SAE潜变量结合动态衰减策略，能更有效、稳定地引导语言模型推理过程。 Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

Punit Kumar Singh,Nishant Kumar,Akash Ghosh,Kunal Pasad,Khushi Soni,Manisha Jaishwal,Sriparna Saha,Syukron Abu Ishaq Alfarozi,Asres Temam Abagissa,Kitsuchart Pasupa,Haiqin Yang,Jose G Moreno

Main category: cs.CL

TL;DR: 本文提出了CultSportQA，一个用于评估语言模型对60个国家和6个大洲传统体育理解能力的基准，涵盖文本和图像模态的33,000道多选题，分为历史、规则和情景三类，并通过多种提示方法在不同规模的语言模型上进行评估。

Details

Motivation: 现有语言模型主要针对全球流行体育进行评估，忽视了地区性和本土体育传统，因此需要一个能够衡量模型对多元文化体育理解能力的基准。 Method: 构建了一个包含33,000道多选题的多语言、多模态数据集CultSportQA，覆盖60个国家、6个大洲及四种文化类别，问题分为历史、规则和情景三类；采用零样本、少样本和思维链提示方法，在大型、小型及多模态语言模型上进行评估。 Result: CultSportQA成为首个全面评估语言模型对传统体育理解能力的基准，实验结果显示不同模型在跨文化体育知识上的表现存在差异，尤其在非主流体育项目上表现较差。 Conclusion: CultSportQA为评估AI在多元文化和多语言环境下的传统体育理解与推理能力设立了新标准，突显了当前模型在文化多样性方面的局限性。 Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs' understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI's ability to understand and reason about traditional sports.

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

Ruyue Liu,Rong Yin,Xiangzhen Bo,Xiaoshuai Hao,Yong Liu,Jinwen Zhong,Can Ma,Weiping Wang

Main category: cs.CL

TL;DR: 提出了一种面向文本属性图的结构感知自监督学习方法SSTAG，通过结合大语言模型和图神经网络的优势，提升了跨域迁移能力和可扩展性。

Details

Motivation: 现有图学习模型通常在单个图数据集上训练，难以跨图和跨任务迁移知识，且依赖大量标注数据；此外，图数据的异质性（如特征空间和结构差异）给统一建模带来挑战。 Method: 提出SSTAG，利用文本作为统一表示媒介，融合大语言模型的语义推理与图神经网络的结构建模能力；设计双知识蒸馏框架，将LLM和GNN共同蒸馏到结构感知的MLP中，并引入内存机制存储典型图表示以增强泛化能力。 Result: 实验表明，SSTAG在跨域迁移任务上优于现有最先进模型，具备高可扩展性、低推理成本，同时保持竞争力的性能。 Conclusion: SSTAG有效 bridging LLMs and GNNs for text-attributed graphs，通过结构感知蒸馏和内存机制实现高效、可扩展且泛化能力强的图学习。 Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model's generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang,Dong-Shan Jian,Xiang Li,Ce Meng,Ling-Shi Meng,Chen-Xu Yan,Zhi-Zhang Bian,Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA（Logical Chain Augmentation）是一个用于自动清理科学语料库的新框架，通过补充缺失的逻辑步骤并分离科学原理与其推导过程，显著降低科学问答数据集中的错误率。

Details

Motivation: 现有科学问答数据集中存在高错误率，主要源于答案中的逻辑跳跃和隐式推理，限制了科学AI的发展。 Method: 提出LOCA框架，采用增强-审查循环机制，对原始答案补充缺失的逻辑步骤，并明确区分科学原理与后续推导。 Result: 在具有挑战性的科学语料库上应用LOCA后，通常能将数据集的错误率从高达20%降至2%以下，有效过滤噪声数据。 Conclusion: LOCA为构建高质量科学语料库提供了一种可扩展且有效的方法，有助于提升科学AI的训练与评估可靠性。 Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Trung Duc Anh Dang,Ferdinando Pio D'Elia

Main category: cs.CL

TL;DR: 本论文提出了一个基于120亿参数Gemma-3多语言Transformer的文本去毒化系统，结合LoRA微调与少样本、思维链提示技术，在15种语言上实现了高效的毒性语句中性重写，系统在高资源和低资源语言中均排名第一。

Details

Motivation: 随着社交媒体平台快速发展而监管滞后，需要自动化工具帮助内容审核员大规模维护安全讨论环境，因此提出多语言文本去毒化方法。 Method: 采用12B参数的Gemma-3多语言Transformer模型，使用LoRA进行高效微调，并结合少样本学习和思维链（CoT）提示；训练数据包括人工标注、机器翻译生成及模型自生成并经Jaccard过滤的数据；推理时引入LaBSE检索的邻居句子和显式毒性片段标注。 Result: 系统在Style Transfer Accuracy、语义保持（LaBSE）和流畅度（xCOMET）指标上表现优异，位居高资源和低资源语言榜首；消融实验显示少样本提升+0.081联合得分，CoT提升+0.088；方差分析表明语言资源状态是性能最强预测因子（η²=0.667, p<0.01）。 Conclusion: 该多语言去毒系统通过参数高效微调与增强提示策略，在多种语言中实现了高性能的毒性内容重写，具有较强的实用性和泛化能力，适用于全球范围的内容 moderation。 Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Carlo Bono,Federico Belotti,Matteo Palmonari

Main category: cs.CL

TL;DR: 提出一种基于单次生成的自监督方法，利用token级特征从大语言模型输出中估计实体链接任务的不确定性，显著降低计算成本的同时有效检测低准确率结果。

Details

Motivation: 大语言模型在实体链接任务中表现优异，但实际应用中需要可靠的不确定性估计，传统多轮推理方法计算开销大，限制了其广泛应用。 Method: 采用自监督学习方法，利用单次生成（single-shot）的大语言模型输出中的token级特征来估计预测的不确定性，避免了资源消耗大的多轮推理。 Result: 在多个大语言模型和表格数据上的实体链接任务中验证了该方法的有效性，不确定性估计能高效识别低准确率输出，且计算成本显著降低。 Conclusion: 该方法为大语言模型驱动的实体链接流程提供了一种高效、低成本的不确定性估计方案，有助于提升其在现实场景中的适用性和可靠性。 Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran,Katharina Simbeck

Main category: cs.CL

TL;DR: 通过将稀疏自编码器（SAEs）与仅在简·奥斯汀小说上训练的GPT风格模型结合，揭示了语言模型不仅能捕捉文本中的关键叙事和概念（如性别、阶级和社会责任），还能作为探索复杂数据集结构、主题和偏见的有效工具。

Details

Motivation: 随着大语言模型在大规模、未经筛选的数据上训练，理解模型表示及其内化的内容变得愈发困难。本文旨在探索如何通过稀疏自编码器解析模型内部表征，进而揭示训练数据中的深层结构与潜在偏见。 Method: 训练一个仅基于简·奥斯汀小说的GPT式Transformer模型，并在其多层隐藏状态上应用稀疏自编码器（SAE），以提取可解释的稀疏特征。 Result: 成功识别出反映语料中核心叙事和概念（如性别、阶级、社会义务）的稀疏且可解释的特征，验证了SAE在解析模型表示和数据内容方面的有效性。 Conclusion: LLM结合SAE可作为可扩展的工具，用于深入探索训练数据的内在结构、主题和偏见，为大规模语料分析和模型可解释性提供了新路径。 Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish,Gustav Eje Henter,Éva Székely

Main category: cs.CL

TL;DR: 本文研究了语音大语言模型（SpeechLLMs）中基于多项选择题问答（MCQA）的偏见和公平性基准测试的有效性，发现这些基准在不同任务和长文本生成任务中的表现缺乏跨任务泛化能力。

Details

Motivation: 现有的SpeechLLMs偏见评估主要依赖MCQA格式，但其是否能推广到其他任务形式（如长文本生成）尚不明确，本文旨在检验这一关键假设。 Method: 通过使用LoRA适配器对三个SpeechLLMs进行微调，使其在MCQA任务中表现出对刻板、反刻板或中立答案的偏好，并评估这些行为在其他MCQA基准和长文本创造性生成任务中的泛化能力。 Result: MCQA偏见基准的表现无法可靠预测模型在其他MCQA任务或长文本生成任务中的行为，表明当前基准的跨任务泛化能力有限。 Conclusion: 当前的MCQA偏见基准在语音领域缺乏足够的跨任务可迁移性证据，作者建议未来应开发更能反映真实场景的评估套件来衡量行为的一致性。 Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Yunlang Dai,Emma Lurie,Danaé Metaxa,Sorelle A. Friedler

Main category: cs.CL

TL;DR: 本文提出了AI Watchman，一个用于长期监测和公开追踪大语言模型（LLM）拒绝行为的系统，以提高公司内容审核政策的透明度。研究通过涵盖400多个社会议题的数据集，对GPT-4.1、GPT-5、DeepSeek（中英文）等模型进行审计，发现AI Watchman能够检测到未公开宣布的政策变化，并揭示不同公司和模型在内容审核上的差异。

Details

Motivation: 大语言模型的内容审核政策通常由公司制定且不透明，其拒绝生成某些内容的行为影响公众讨论。缺乏对这些政策动态变化的公开监督机制，促使作者开发可量化和追踪审核行为的系统。 Method: 构建名为AI Watchman的纵向审计系统，使用包含400多个社会议题的数据集，定期调用OpenAI的审核接口及多个大模型（GPT-4.1、GPT-5、DeepSeek中英文版），记录其拒绝响应的情况，并分析拒绝模式的时间演变与跨模型差异。 Result: AI Watchman成功检测到未公开的内容政策变更；发现了不同公司和模型在审核严格性和拒绝类型上的显著差异；并对拒绝形式进行了定性分类，如完全拒绝、部分回应、道德化回应等。 Conclusion: 长期审计大语言模型的拒绝行为有助于提升AI系统的透明度和问责性，AI Watchman为实现这一目标提供了一个可行框架，证明了外部持续监控的重要性。 Abstract: Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

Can Lin,Zhengwang Jiang,Ling Zheng,Qi Zhao,Yuhang Zhang,Qi Song,Wangqiu Zhou

Main category: cs.CL

TL;DR: 提出了一种名为Retrieval-Judgment-Exploration (RJE)的框架，通过检索、判断和探索优化知识图谱问答中的推理过程，支持小规模开源大模型实现高效、低成本的问答性能。

Details

Motivation: 现有知识图谱问答方法受限于检索质量或依赖专有大模型，效率低且成本高。 Method: 设计RJE框架，包含推理路径排序、问题分解和检索辅助探索模块，在不微调的情况下提升小规模语言模型的表现。 Result: 在使用GPT-4o-mini等模型时优于基线方法，小规模开源模型（如3B、8B）也达到竞争力结果，同时显著减少LLM调用次数和token使用量。 Conclusion: RJE框架有效提升了KGQA的效率与通用性，支持轻量级模型在低资源消耗下实现高性能问答。 Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Nathan Junzi Chen

Main category: cs.CL

TL;DR: 本研究通过零样本分类方法评估六种主流大语言模型的政治偏见，发现所有模型均表现出自由-威权倾向，并讨论了其对公共话语和政治格局的潜在影响。

Details

Motivation: 生成式人工智能在政治话语中的广泛应用引发了对其内在政治偏见的关注，这些偏见可能源于训练数据偏差、人类偏见和算法缺陷。 Method: 采用零样本分类方法，结合意识形态对齐、话题相关性、回应情感和客观性四个指标，将1800个模型响应输入四个微调后的分类算法进行分析。 Result: 所有六种大语言模型均表现出强化的自由-威权取向，存在明显的推理取代和预设拒绝现象。 Conclusion: 大语言模型中的内在偏见可能通过人机交互影响公众话语，导致不同社会政治结构下的从众或极化现象。 Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region's pre-existing socio-political structures.

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

Main category: cs.CL

TL;DR: 本研究探讨了社会语用框架、语言选择和指令层级对gpt-oss-20b模型拒绝行为的影响，发现特定复合提示可显著改变其响应率，并揭示了不同语言和角色设定下的信息泄露风险及评估不一致性。

Details

Motivation: 理解大型语言模型在不同提示框架下的拒绝行为机制，以提升其安全性和可控性。 Method: 通过80次每场景的种子实验，测试多种危害领域（如ZIP炸弹构建、合成卡号生成等），采用复合提示（教育者身份、安全前提、步骤提示）进行干预，并比较不同语言、角色设定及评估方式下的模型表现。 Result: 复合提示使ZIP炸弹任务的辅助率从0%升至97.5%；德语和法语正式语体比英语更易泄露信息；‘Linux终端’角色扮演在多数情况下绕过开发者规则；提出一种AI辅助加固方法将泄漏降至0%；13%的评估配对中存在不一致响应；OpenAI审核API漏判较多实际有害输出，且不同推理堆栈间拒绝率差异达5-10个百分点。 Conclusion: 模型行为高度受提示框架影响，当前审核机制存在漏洞，需改进评估方法与系统级防护以增强可重复性与安全性。 Abstract: We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: 本研究探讨了GPT-OSS-20b模型在豪萨语（低资源语言）环境下的安全性和可靠性问题，发现其存在偏见、文化不敏感和事实错误，并可通过礼貌性提示绕过安全机制，暴露出语言层面的奖励黑客行为。

Details

Motivation: 质疑大模型在代表性不足社区用户中的可靠性，特别是在低资源语言背景下的安全性与公平性。 Method: 以豪萨语为例进行红队测试，使用最小提示诱导模型生成内容，分析其在文化敏感性、事实准确性和安全对齐方面的表现，并结合本地调查数据评估风险。 Result: 发现模型存在严重安全隐患，如误认剧毒物质为可食用、无法区分生熟食品、使用贬损性谚语等；且在面对礼貌或感激性语言时安全协议放松，易产生有害输出。 Conclusion: 这些问题源于低资源语言环境中安全微调不足，反映了当前红队测试在非主流语言上的盲点，需加强多语言安全对齐与本地化评估。 Abstract: In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou,Jin Zhu,Pingfan Su,Kai Ye,Ying Yang,Shakeel A O B Gavioli-Akilagun,Chengchun Shi

Main category: cs.CL

TL;DR: 提出AdaDetectGPT，一种通过自适应学习显著提升LLM生成文本检测性能的新型分类器，在多种数据集和模型组合下性能优于现有方法，最大提升达58%。

Details

Motivation: 现有基于logits的检测方法仅依赖对数概率，可能次优，需更有效的检测机制。 Method: 引入AdaDetectGPT，自适应地从训练数据中学习一个witness函数，以增强logits-based检测器的性能，并提供统计保证。 Result: 在多种数据集和LLM组合下，AdaDetectGPT几乎全面超越现有最先进方法，性能提升最高达58%。 Conclusion: AdaDetectGPT通过自适应学习显著提升了文本来源检测的准确性，为LLM生成文本检测提供了更可靠的方法。 Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan,Victor Li,Qi Lei

Main category: cs.CL

TL;DR: 本文提出了一种名为渐进式自我反思（PSR）的推理时技术，用于提升大语言模型（LLM）在生成文本时的安全性，能够在无需额外训练的情况下显著降低攻击成功率，同时保持对良性任务的性能。

Details

Motivation: 大语言模型虽然在自然语言处理方面表现出色，但存在生成有害或不当内容的风险，因此需要一种动态、可扩展的方法在推理阶段提升其安全性。 Method: 提出Progressive Self-Reflection（PSR）方法，使LLM在推理时能够自我监控并动态修正输出；引入轻量级自我反思预测器，根据输入复杂度自适应地决定反思轮数，以平衡安全性和计算开销。 Result: 在Llama-3.1-8B-Instruct上攻击成功率从77.5%降至5.9%，Llama-3.1-8B base上从89.7%降至5.6%，Qwen2.5-7B-Instruct上从44.4%降至3.8%，且未影响良性任务性能。 Conclusion: PSR是一种可扩展的测试时安全增强方法，能根据输入风险动态分配计算资源，在保证效率的同时显著提升大语言模型的安全性。 Abstract: Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input's risk profile.

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Shenxu Chang,Junchi Yu,Weixing Wang,Yongqiang Chen,Jialin Yu,Philip Torr,Jindong Gu

Main category: cs.CL

TL;DR: 提出了一种名为TraceDet的新框架，利用D-LLMs的多步去噪过程中的中间步骤进行幻觉检测，显著提升了检测性能。

Details

Motivation: 现有的幻觉检测方法主要针对自回归大模型（AR-LLMs），依赖单步生成信号，难以适用于扩散大语言模型（D-LLMs）的多步去噪过程，因此需要一种能有效利用中间步骤信息的幻觉检测方法。 Method: TraceDet将D-LLMs的去噪过程建模为动作轨迹，每个动作是基于前一步中间输出对清理后响应的预测；通过识别对幻觉响应最具信息量的子轨迹，捕捉多步过程中的关键幻觉信号用于检测。 Result: 在多个开源D-LLMs上的实验表明，TraceDet在幻觉检测上平均AUROC提升了15.2%，显著优于基线方法。 Conclusion: TraceDet有效利用了D-LLMs的中间去噪步骤，为扩散型大模型的幻觉检测提供了新思路，并显著提高了检测准确性。 Abstract: Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

Sumaiya Tabassum

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的BERT模型和大语言模型（LLM）在孟加拉国电商评论情感分析中的应用，使用4000条孟加拉语和英语评论数据进行微调，结果表明Llama-3.1-8B模型表现最佳，准确率达95.5%，并结合LoRA和PEFT技术降低计算开销。

Details

Motivation: 由于文本语言的复杂性和多语言环境的多样性，传统情感分析面临挑战，尤其是在资源有限的语言如孟加拉语中，因此需要探索适用于本地电商评论的高效大语言模型。 Method: 采用参数高效微调方法（LoRA和PEFT），对多种大语言模型（包括Llama、Phi-3.5、Mistral、DistilBERT、mBERT和XLM-R）在4000条双语电商评论数据上进行微调，并比较其性能。 Result: 微调后的Llama-3.1-8B模型在准确率（95.5%）、精确率（93%）、召回率（88%）和F1分数（90%）上均优于其他模型，且通过LoRA和PEFT显著降低了计算资源消耗。 Conclusion: 大语言模型，特别是Llama-3.1-8B，结合参数高效微调技术，在低资源多语言情感分析任务中具有高可行性和优越性能，适合应用于类似资源受限的实际场景。 Abstract: Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author's emotional state. This method is essential since it makes it easier to comprehend consumers' feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen,Jiefeng Chen,Rui Meng,Ji Yin,Na Li,Chuchu Fan,Chi Wang,Tomas Pfister,Jinsung Yoon

Main category: cs.CL

TL;DR: 本文提出了TUMIX，一种通过并行运行多个采用不同工具使用策略的代理来增强大语言模型推理能力的集成框架，在关键推理基准测试中显著优于现有方法。

Details

Motivation: 现有的大语言模型在结合文本推理、编码和搜索方面缺乏有效的工具使用指导，难以应对多样化问题。 Method: 提出TUMIX框架，多个代理并行运行，采用不同的工具使用策略，并通过迭代共享和优化回答来提升性能；引入LLM自动优化代理设计，并可根据置信度提前终止以降低成本。 Result: 在Gemini-2.5-Pro和Gemini-2.5-Flash上，TUMIX相比最优基线平均准确率提升达3.55%，推理成本相近；通过提前终止机制可在保持性能的同时将成本降至49%。 Conclusion: 代理的多样性与质量对性能至关重要，TUMIX通过集成多策略代理和动态优化实现了高效且强大的工具增强推理。 Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Israel Abebe Azime,Tadesse Destaw Belay,Atnafu Lambebo Tonja

Main category: cs.CL

TL;DR: 本文提出了一种评估深度研究工具能力的评估表，并以学术综述写作为案例，评估了OpenAI和Google的深度搜索在生成学术综述方面的表现，揭示了现有工具在覆盖目标领域方面的不足。

Details

Motivation: 为了评估具备代理能力的大型语言模型在知识密集型任务中的表现，特别是深度研究工具在自动生成学术综述方面的能力。 Method: 设计了一个评估表，并以学术调查写作为任务，对OpenAI和Google的深度搜索生成的报告进行评估。 Result: 发现当前深度研究工具在全面覆盖目标研究领域方面存在明显不足，且与传统搜索引擎相比仍有显著差距。 Conclusion: 需要制定更精细的评估标准来衡量深度研究工具的性能，当前技术尚不足以完全替代人工撰写学术综述。 Abstract: Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar,Sujay Sanghavi,Poulami Das

Main category: cs.CL

TL;DR: 本文提出了HiSpec，一种利用早期退出（EE）模型进行低开销中间验证的分层推测解码框架，显著提升了大语言模型推理吞吐量，平均加速1.28倍，最高达2.01倍，且不牺牲准确性。

Details

Motivation: 现有推测解码中验证阶段常成为瓶颈，而中间验证方法存在训练开销大、内存占用高和准确率下降等问题，因此需要一种高效、低开销且保持精度的中间验证机制。 Method: 提出HiSpec框架，利用专门训练的早期退出（EE）模型进行中间验证，并设计了键值缓存和隐藏状态在草案、中间验证器和目标模型间的复用机制，同时周期性地用目标模型校验中间结果以保证准确性。 Result: 在多个基准和模型上的实验表明，HiSpec相比基线单层推测解码平均提升吞吐量1.28倍，最高提升2.01倍，且未损失精度。 Conclusion: HiSpec通过结合早期退出模型与状态复用策略，实现了高效、低开销的中间验证，显著提升了推测解码的吞吐量，为大模型推理提供了一种高性价比的加速方案。 Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Maithili Kadam,Francis Ferraro

Main category: cs.CL

TL;DR: TAG-EQA 是一种将因果事件图注入大语言模型输入的提示框架，通过结合文本和结构化图信息，在事件问答任务中显著提升模型性能，尤其在零样本设置下表现突出。

Details

Motivation: 大语言模型在处理需要因果或时序推理的事件类问题时表现不佳，因此需要一种无需微调即可增强其事件推理能力的方法。 Method: 提出 TAG-EQA 框架，将结构化的因果事件图转化为自然语言语句，并融合到提示中；涵盖九种提示配置，结合三种策略（零样本、少样本、思维链）与三种输入模态（纯文本、纯图、文本+图）。 Result: 在 TORQUESTRA 基准上，TAG-EQA 平均比纯文本基线提高 5% 准确率，零样本设置下最高提升 12%，使用图增强的思维链提示时提升达 18%。 Conclusion: 因果图能够有效增强大语言模型的事件推理能力，而无需微调，TAG-EQA 提供了一种灵活的基于提示的结构化知识注入方法。 Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

Nicolás Aguirre,Ramiro Caso,Ramiro Rodríguez Colmeiro,Mauro Santelli,Joaquín Toranzo Calderón

Main category: cs.CL

TL;DR: 提出一种基于语义嵌入距离的无结构评估方法，以低成本实现对语言模型生成结果的自动分类，性能接近人类标注者。

Details

Motivation: 现有语言模型响应评估方法成本过高（如LLM-as-a-Judge）或脱离真实场景（如字符串匹配、logprob），需要更高效且贴近实际的自动评估方案。 Method: 利用小型嵌入模型（<10B参数）计算语义嵌入距离，将目标候选与任意LM生成文本进行匹配，实现响应的鲁棒分类。 Result: 在3个数据集和3种不同LM架构上测试，回归得分约0.97，准确率约96%，与人工标注高度一致。 Conclusion: 该结构无关的语义嵌入方法可在较低计算成本下实现高质量的LM响应自动评估，具备实际应用潜力。 Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

Mengyu Wang,Sotirios Sabanis,Miguel de Carvalho,Shay B. Cohen,Tiejun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为专家问题分解（EQD）的方法，用于提升大语言模型在特定领域（尤其是金融领域）复杂问答任务中的定量推理能力。该方法通过两步微调框架和奖励函数引导生成有效的子问题，在仅需少量训练样本和单张A100 GPU的情况下显著提升性能。

Details

Motivation: 大语言模型在需要专业知识和复杂推理的领域（如金融）中表现有限，尤其是在定量推理方面存在挑战。现有方法往往计算成本高或效果不佳，因此需要一种高效且能有效融合领域知识的解决方案。 Method: 提出专家问题分解（EQD），采用两步微调框架，并设计奖励函数评估生成子问题对最终问答效果的贡献。利用少量训练数据（数千样本）在单个A100 GPU上进行微调，推理时间与零样本提示相当。 Result: 在四个金融领域的基准数据集上评估，EQD在不同大语言模型上将问答性能提升了0.6%至10.5%，优于当前最先进的领域微调模型和高级提示策略。分析发现，一个有效的支持性问题比多个详细步骤更具增益。 Conclusion: EQD在保持计算效率的同时，显著提升了大语言模型在专业领域复杂问答中的表现，表明合理的问题分解比冗长的推理步骤更有效，为低成本、高性能的领域专用推理提供了新思路。 Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You,Baojing Liu

Main category: cs.CL

TL;DR: ReSSFormer是一种递归稀疏结构化Transformer，通过循环推理、自适应稀疏注意力和自组织编码结构，在长上下文推理、计算效率和结构泛化方面优于传统模型。

Details

Motivation: 解决传统Transformer在长上下文推理、计算效率和结构泛化方面的局限性，特别是由固定层堆叠、密集注意力和位置编码依赖引起的问题。 Method: 提出ReSSFormer，包含三个核心组件：循环推理与记忆单元（R2MU）实现迭代推理，自适应稀疏注意力模块（ASAM）实现高效上下文选择，自组织编码结构（SOES）进行无位置的结构归纳。用循环推断替代深度堆叠，用稀疏注意力替代全注意力。 Result: 在语言建模、多跳问答和结构敏感任务上，ReSSFormer在相似FLOPs和参数预算下持续优于强基线模型。 Conclusion: ReSSFormer在可扩展性、效率和结构灵活性方面表现优异，为Transformer架构提供了更高效的替代方案。 Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Zhenwen Liang,Ruosen Li,Yujun Zhou,Linfeng Song,Dian Yu,Xinya Du,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型内部隐藏状态的验证方法Clue，通过分析隐藏层激活轨迹中的几何可分特征来判断输出正确性，无需训练参数，仅依靠过去经验形成的“成功”和“失败”聚类进行分类，在多个任务上优于LLM-as-a-judge和置信度基线方法。

Details

Motivation: 现有评估大语言模型输出质量的方法依赖文本层面信息或校准后的置信度，存在过拟合或对未校准模型失效的问题，而隐藏状态蕴含更丰富的语义和信心信息，值得直接利用。 Method: 提出Clue（Clustering and Experience-based Verification），一种无参数验证器，通过计算推理过程中隐藏状态的变化（delta），并基于与历史‘成功’和‘失败’聚类中心的最近距离来判断输出正确性。 Result: Clue在AIME 24/25和GPQA等多个基准上优于LLM-as-a-judge基线，匹敌或超过现代置信度方法，在1.5B模型的AIME 24任务中将准确率从56.7%（majority@64）提升至70.0%（top-maj@16）。 Conclusion: 大语言模型隐藏状态中包含可用于验证输出正确性的强信号，Clue的简洁设计证明了该信号的有效性，为模型输出评估提供了新方向。 Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model's internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to ``success'' and ``failure'' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Neal Gregory Lawton,Alfy Samuel,Anoop Kumar,Daben Liu

Main category: cs.CL

TL;DR: 我们评估并比较了多种RAG微调策略，包括独立微调、联合微调和两阶段微调。实验表明这些方法在生成质量指标上提升相当，但计算成本差异显著。最佳策略取决于训练数据是否包含上下文标签以及是否需要对学习率进行网格搜索。

Details

Motivation: 不同的RAG微调策略具有不同的成本与收益，但缺乏系统性比较，因此需要评估各种策略的性能与开销以指导实际应用。 Method: 对比了独立微调、联合微调和两阶段微调三种策略，在多个任务上评估其在EM和F1指标上的表现，并分析计算成本差异。 Result: 所有微调策略在EM和F1指标上的提升效果相近，但计算成本有显著差异。是否需学习率网格搜索及训练数据是否含上下文标签是选择策略的关键因素。 Conclusion: 最佳微调策略的选择应基于是否有上下文标签数据以及是否需要联合优化学习率，需在计算资源和实现复杂度之间权衡。 Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi,Sai Prasad Gudari,Atharva Negi,Praveen Raj Budime,Harsit Upadhya

Main category: cs.CL

TL;DR: 提出RAG-BioQA框架，结合检索增强生成与领域微调，生成基于证据的长篇生物医学答案，在PubMedQA数据集上显著优于基线模型。

Details

Motivation: 现有生物医学问答系统多集中于短答案，缺乏临床决策所需的详尽解释，难以满足对精确医学信息的需求。 Method: 结合检索增强生成（RAG）与领域特定微调，利用BioBERT嵌入和FAISS索引进行文档检索，比较BM25、ColBERT、MonoT5等重排序策略优化上下文选择，并通过微调T5模型合成证据生成长答案。 Result: 在PubMedQA数据集上实验表明，该方法在BLEU、ROUGE和METEOR指标上均显著优于基线模型，有效提升生物医学知识检索性能。 Conclusion: RAG-BioQA能够生成高质量、证据支持的长篇生物医学答案，推动了可访问、基于证据的生物医学知识检索的发展。 Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

Yu-Cheng Chih,Ming-Tao Duan,Yong-Hao Hou

Main category: cs.CL

TL;DR: 本文提出了一种三阶段的稳定化流程PureTC-1B，通过参数高效的LoRA适配器提升Llama-3.2-1B-Instruct模型在传统中文（TC）生成中的语言纯度，显著减少非TC字符输出，在真实场景基准和命名实体翻译任务中均表现出显著改进。

Details

Motivation: 小型语言模型（SLMs）在传统中文（TC）应用中存在生成不稳定问题，常出现非TC字符或语码混用，限制了其在实际场景中的部署，亟需提高生成的语言一致性与可靠性。 Method: 采用三阶段稳定化流程：基于TC语料进行持续预训练（CPT），使用指令数据进行监督微调（SFT），并通过直接偏好优化（DPO）引入TC语言遵循偏好；整个过程使用LoRA适配器实现，无需全模型重训练。 Result: 在模拟真实使用的基准测试中，PureTC-1B相比基础模型减少了51.3%的非TC输出标记（micro-average）；在命名实体翻译任务中，相较于Llama-3B和Qwen-1.5B分别减少77.2%和57.2%的错误语言标记。 Conclusion: 即使在1B规模的小型模型上，也能通过高效的适配器方法实现强健的TC语言一致性，该方法可复现、硬件友好，为增强非英语语言的生成稳定性提供了实用方案。 Abstract: Small Language Models (SLMs) enable cost-effective, on-device and latency-sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token-level instability - models unpredictably emit non-TC characters or code-switch into other languages. We address this practical reliability gap by creating PureTC-1B, a three-stage stabilization pipeline for Llama-3.2-1B-Instruct (an open-weight, instruction-tuned model released by Meta) using parameter-efficient LoRA adapters. Our method combines Continual Pre-Training (CPT) on TC-centric corpora, Supervised Fine-Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) using TC-adherence preferences to improve monolingual robustness without full-model retraining. On a benchmark designed to simulate real-world usage, PureTC-1B achieves a 51.3% relative reduction (micro-average) in non-TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC-1B further reduces incorrect-language tokens by 77.2% relative to Llama-3B and 57.2% relative to Qwen-1.5B, indicating that robust TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non-English languages.

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong,Yuheng Li,Yuqing Wu,Wenwen Ouyang,Wei Zhu,Jiechao Gao

Main category: cs.CL

TL;DR: 提出AMAS框架，通过动态图设计实现基于大语言模型的多智能体系统结构自适应，显著提升在问答、数学推理和代码生成等任务中的性能。

Details

Motivation: 传统多智能体系统架构依赖固定、手工设计的图结构，缺乏上下文响应能力，限制了大语言模型在多样化任务中的有效性。 Method: 引入AMAS框架，采用轻量级大语言模型适配的动态图设计器，根据任务需求自主生成最优图配置，并利用输入特征智能引导查询路径。 Result: 在问答、数学推理和代码生成等多个基准上，AMAS consistently超越现有的单智能体和多智能体方法，适用于多种大语言模型架构。 Conclusion: 上下文敏感的结构适应性是高性能大语言模型多智能体系统部署的基础要求。 Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins,Aditya Pramar,Rodney Beard,Rohitash Chandra

Main category: cs.CL

TL;DR: 该研究探讨了不同机器学习模型识别大语言模型（LLM）中的“越狱提示”（jailbreak prompts）的能力，发现基于当前数据集，微调的BERT模型在端到端识别中表现最佳，并指出提示词结构中的显式自反性可能是越狱意图的信号。

Details

Motivation: 大语言模型存在安全漏洞，恶意用户可通过设计特定输入（即越狱提示）绕过安全防护机制，因此需要有效方法来检测此类攻击。 Method: 使用多种机器学习模型分析并区分越狱提示与正常提示，重点评估模型对未见过的越狱策略的识别能力，并通过可视化关键词分析其特征。 Result: 在现有数据集上，经过端到端微调的BERT模型表现出最优的越狱提示检测性能；关键词可视化显示，提示结构中的显式自反性可能指示越狱意图。 Conclusion: 微调后的BERT模型是当前检测越狱提示最有效的方法之一，且提示的结构特征（如自反性）可作为重要的检测信号，有助于提升LLM的安全防护能力。 Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Zhaoxin Feng,Jianfei Ma,Emmanuele Chersoni,Xiaojing Zhao,Xiaoyi Bao

Main category: cs.CL

TL;DR: 本文探讨了通过在Llama架构中引入双向注意力机制和对比学习来提升大语言模型在文本嵌入任务中的表现。

Details

Motivation: 由于单向注意力机制的限制，自回归大语言模型在文本嵌入和语义表示探针任务中的应用较慢，本文旨在克服这一限制。 Method: 通过对Llama架构的不同变体进行额外训练，逐步引入双向注意力机制，并结合无监督/有监督对比学习进行测试。 Result: 实验结果表明，引入双向注意力机制能够有效提升模型在文本嵌入任务中的性能。 Conclusion: 双向注意力机制可以缓解自回归模型在语义表示上的局限性，增强其在下游任务中的适用性。 Abstract: Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

Mudita Khurana,Raunak Jain

Main category: cs.CL

TL;DR: 本文提出了CLASP框架和CLC评分，用于评估闭环自主安全系统在网络安全生命周期中的智能体能力，填补了现有研究在统一框架、评估方法和基准测试方面的空白。

Details

Motivation: 当前网络安全领域缺乏统一的框架、评估方法和基准来衡量自主安全智能体的闭环能力，导致研究碎片化，难以系统性提升防御能力。 Method: 提出CLASP框架，将安全生命周期与智能体核心能力对齐，并设计CLC评分作为综合指标，通过分析21项代表性工作验证框架有效性。 Result: 成功应用CLASP分析21项研究，识别出系统的能力优势与缺口，定义了可量化闭环程度与操作效能的CLC评分，并提出了闭环基准的构建要求。 Conclusion: CLASP和CLC评分为评估和推进闭环自主安全系统提供了必要的术语体系、诊断工具和度量标准，有助于推动该领域的系统化发展。 Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu,Jianfeng He,Hang Su,Ruixue Lian,Yi Nian,Jake Vincent,Srikanth Vishnubhotla,Robinson Piramuthu,Saab Mansour

Main category: cs.CL

TL;DR: 本文提出了MDSEval，首个针对多模态对话摘要的元评估基准，包含图像共享对话、摘要及人类在八个质量维度上的评分，并提出基于跨模态互斥关键信息（MEKI）的过滤框架以保证数据质量，揭示了现有评估方法在区分先进MLLM生成摘要和偏倚方面的局限性。

Details

Motivation: 为了支持有效的多模态对话摘要（MDS）模型开发，需要强大的自动评估方法，而这些方法依赖于基于人类标注的高质量元评估基准，但目前缺乏这样的基准。 Method: 构建了一个包含多模态对话、摘要和人类评分的新基准MDSEval，并提出一种基于跨模态互斥关键信息（MEKI）的过滤框架来提升数据质量；同时定义了MDS特有的评估维度，并对当前最先进的评估方法进行了系统评测。 Result: MDSEval是首个面向MDS的元评估基准，覆盖八种明确的质量方面；实验表明现有评估方法难以区分由先进MLLM生成的摘要，且容易受到多种偏倚影响。 Conclusion: 本研究填补了MDS领域元评估基准的空白，形式化了关键评估维度，并揭示了现有自动评估方法的不足，为未来更可靠的MDS评估提供了基础。 Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

He Zhang,Anzhou Zhang,Jian Dai

Main category: cs.CL

TL;DR: FOR-Prompting是一种无需训练、基于提示的角色分工协议，通过引入质疑机制（Objectioner）实现自我修正，在数学推理和开放性任务中显著提升大小模型的推理准确性与连贯性。

Details

Motivation: 现有推理方法如思维链（CoT）缺乏外部质疑引发自我修订的机制，限制了模型的自我纠错能力，尤其在小模型和复杂问题上的表现不足。 Method: 提出FOR-Prompting协议，包含三个角色：Defender（提出答案）、Objectioner（提出不直接修复的质疑问题）、Host（确保逻辑一致性和终止），通过结构化对话轮次实现推理优化，完全在提示层面运行，无需工具或人类干预。 Result: 在GSM8K上比单次提示准确率提升约22%，与CoT相当且推理质量评分高出10%以上；Llama3.2:1b小模型准确率提升约19%；能自主纠正难题错误，并在开放任务中促进更深入的探索与假设显式化。 Conclusion: FOR-Prompting通过引入外部质疑机制有效增强模型自我修订能力，具有模型无关性和部署灵活性，对提升小模型推理性能和推动可扩展的反对引导推理研究具有重要意义。 Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.

[57] How Do Language Models Compose Functions?

Apoorv Khandelwal,Ellie Pavlick

Main category: cs.CL

TL;DR: 研究大语言模型（LLM）在解决两跳事实回忆任务时是否使用组合机制，发现存在“组合性差距”，并识别出组合式和直接式两种处理机制，其选择与嵌入空间几何结构相关。

Details

Motivation: 探究大语言模型在执行组合任务时是否真正采用组合性机制，而非仅依赖表面模式匹配。 Method: 通过logit lens分析残差流激活，研究模型在两跳事实回忆任务中的内部机制，并分析嵌入空间的几何特性。 Result: 确认了组合性差距的存在；发现了组合式和直接式两种机制；发现机制的选择与嵌入空间中从x到g(f(x))是否存在线性映射有关。 Conclusion: 大语言模型在解决组合任务时可能不采用真正的组合性机制，而是依赖嵌入空间中的捷径（如线性映射），这揭示了其泛化能力的局限性。 Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Seungseop Lim,Gibaeg Kim,Wooseok Han,Jean Seo,Hyunkyung Lee,Jaehyo Yoo,Eunho Yang

Main category: cs.CL

TL;DR: 本文提出了一种针对医疗预咨询中大语言模型出现的“格式惯性”问题的数据中心解决方案，通过重新平衡训练数据中的对话轮次分布来缓解该问题。

Details

Motivation: 由于医疗预咨询任务中的监督微调数据通常存在轮次分布不均衡的问题，导致模型在长对话中倾向于生成重复且无诊断信息的问题，即“格式惯性”。 Method: 采用一种简单的数据中心方法，重新调整训练数据中对话轮次的分布，以减轻格式惯性的影响。 Result: 实验结果表明，所提出的方法能显著缓解大语言模型在医疗预咨询多轮对话中的格式惯性问题。 Conclusion: 重新平衡训练数据的轮次分布是一种有效缓解格式惯性的策略，有助于提升大语言模型在长医疗对话中的生成质量。 Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung,Neel Joshi,Pratyusha Sharma,Youngjae Yu,Vibhav Vineet

Main category: cs.CL

TL;DR: 本文提出了MathLens基准，用于分解多模态推理中的子技能（感知、推理、整合），并评估不同训练方法在几何问题中的表现。

Details

Motivation: 现有评估方法仅依赖总体准确率，难以揭示模型在哪些方面取得进步，因此需要更细粒度的评估基准。 Method: 构建包含视觉图示、文本描述、控制性问题和感知探针的基准，基于符号化问题规范确保一致性；分离感知、推理与整合能力进行独立评估。 Result: 发现强化学习主要提升感知能力，文本监督微调通过反思性推理间接改善感知；推理能力仅在感知同步提升时改进；整合能力最弱，错误集中于此；强化学习提高对图表变化的一致性，而多模态SFT因过拟合降低鲁棒性。 Conclusion: MathLens能有效揭示多模态推理模型各子技能的发展情况，指出当前模型在整合能力和鲁棒性方面的不足，为未来研究提供方向。 Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

[60] Machine-interpretable Engineering Design Standards for Valve Specification

Anders Gjerver,Rune Frostad,Vedrana Barisic,Melinda Hodkiewicz,Caitlin Woods,Mihaly Fekete,Arild Braathen Torjusen,Johan Wilhelm Kluwer

Main category: cs.CL

TL;DR: 本文提出将工程设计标准中的信息转化为模块化、可重用、机器可解释的本体，并用于工厂设计和设备选型的质量保证。

Details

Motivation: 尽管工业界致力于数字化，但目前产品规范和设计标准仍以文档为中心，缺乏机器可读性和互操作性。 Method: 采用建模模式将国际标准中的文本和表格知识转化为符合W3C标准并基于ISO DIS 23726-3（IDO）顶层本体的模块化本体，并在阀门选型过程中进行测试。 Result: 实现了基于语义推理和可执行设计规则的自动化验证，确认特定阀门数据表（VDS）是否符合行业标准，并判断制造商产品类型是否满足规格要求。 Conclusion: 基于IDO的共享、可重用模块化本体支持语义推理应用于设备选型，展示了向数字化智能标准转型的潜力。 Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create "functional location tags" as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan,Jie Xu,Qiguang Chen,Junhao Dong,Libo Qin,Xinfeng Li,Haining Yu,Xiaohua Jia

Main category: cs.CL

TL;DR: 本文提出了一个名为“拒绝指数”（Refusal Index, RI）的新指标，用于准确衡量大语言模型在未知问题上的知识感知拒绝能力。RI定义为拒绝概率与错误概率之间的Spearman秩相关系数，并通过轻量级双轮评估方法进行实际测量。实验表明，RI能稳定、一致地评估模型的拒绝行为，不受准确率和拒绝率影响，揭示了当前模型在事实性任务中拒绝行为的不可靠性。

Details

Motivation: 现有衡量LLM对未知问题拒绝能力的指标存在偏差或间接性，无法真实反映模型的知识感知拒绝能力。因此需要一个更可靠、直接且不受拒绝率干扰的评估指标。 Method: 提出拒绝指数（RI），即拒绝概率与错误概率之间的Spearman秩相关系数；设计一种轻量化的双轮评估方法，通过两次标准评测运行中的观察拒绝率来高效估计RI。 Result: 在16个模型和5个数据集上的实验表明，RI能准确量化模型的知识感知拒绝能力，具有跨拒绝率的稳定性，并提供独立于准确率的一致模型排序。同时发现，尽管LLM在事实任务中准确率高，但其拒绝行为可能不可靠且脆弱。 Conclusion: 拒绝指数（RI）是一种有效、稳定且具洞察力的指标，能够补充传统准确率指标，推动对LLM事实性的全面评估，强调应关注模型在不确定时的拒绝能力。 Abstract: Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model's actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Ivan Leonidovich Litvak,Anton Kostin,Fedor Lashkin,Tatiana Maksiyan,Sergey Lagutin

Main category: cs.CL

TL;DR: 该研究评估了16种无监督指标在从1000份俄罗斯司法判决中提取七个语义块的质量，基于7168条专家评分（1-5分），发现词频连贯性和覆盖率/完整性指标与专家评价最一致，而法律术语密度呈负相关；LLM评估分数表现中等，表明无监督方法虽可扩展但无法完全替代高风险法律场景中的人工判断。

Details

Motivation: 需要可扩展的、无需人工标注的方法来评估法律文本抽取质量，以支持人工智能在法律自然语言处理中的快速发展。 Method: 评估了16种无监督指标（包括文档级、语义、结构、伪真实标签和法律特定指标），使用引导相关性分析、Lin一致性相关系数（CCC）和平均绝对误差（MAE）对比专家评分（1-5李克特量表）进行验证。 Result: Term Frequency Coherence（r=0.540, CCC=0.512, MAE=0.127）和Coverage Ratio/Block Completeness（r=0.513, CCC=0.443, MAE=0.139）与专家评分最一致；Legal Term Density呈强负相关（r=-0.479, CCC=-0.079）；LLM Evaluation Score（r=0.382, CCC=0.325, MAE=0.197）表现中等但显示在法律文本上的专业化有限。 Conclusion: 无监督指标（包括基于LLM的方法）可用于法律文本抽取的可扩展初步筛选，但由于相关性中等且一致性较低，尚不能完全替代高风险法律应用中的人工评估。 Abstract: The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

Xin Liu,Rongwu Xu,Xinyi Jia,Jason Liao,Jiao Sun,Ling Huang,Wei Xu

Main category: cs.CL

TL;DR: 本文提出了一种名为FraudSquad的混合检测模型，用于识别由大语言模型生成的高度逼真的垃圾评论。该模型结合了预训练语言模型的文本嵌入和门控图变换器，无需手动特征工程即可有效捕捉语义和行为信号，在多个数据集上显著优于现有方法。

Details

Motivation: 随着大语言模型的发展，生成的垃圾评论极具说服力且难以检测，严重威胁在线平台的可信度，亟需更有效的检测手段。 Method: 构建了三个基于不同大语言模型生成的垃圾评论数据集，并提出FraudSquad模型，融合文本嵌入与门控图变换器进行节点分类，利用产品元数据和真实评论增强生成真实性。 Result: FraudSquad在三个LLM生成的数据集上比现有最优方法在精确率上最高提升44.22%，召回率最高提升43.01%，同时在人工编写的垃圾邮件数据集上也表现良好，且模型规模小、所需标注数据少。 Conclusion: FraudSquad是一种高效、实用的LLM时代垃圾评论检测方案，研究强调了应对大语言模型滥用问题的紧迫性，并提供了新的合成数据集和开源框架以推动后续研究。 Abstract: The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: https://anonymous.4open.science/r/FraudSquad-5389/.

Dane Williamson,Yangfeng Ji,Matthew Dwyer

Main category: cs.CL

TL;DR: 大型语言模型在数学问题求解中表现出色，但在句法偏离训练分布时易出错。本文发现了一种称为‘句法盲区’的系统性错误模式，并提出通过句法重构和基于依存局部性理论的复杂度度量来揭示和缓解此类错误。

Details

Motivation: 研究LLMs在语义简单但表述形式陌生的问题上的失败原因，探讨其错误源于句法结构与内部表征之间的脆弱耦合，而非数学能力不足。 Method: 通过从正确回答的样本中提取句法模板，对答错的问题进行语义保持的重述，并使用基于依存局部性理论（DLT）的指标量化句法复杂度，分析其与错误率的关系。 Result: 实验表明，经过句法简化后的重述问题显著提高了模型准确率，且较高的DLT得分与多个数据集上的更高错误率相关。 Conclusion: 许多推理错误源于结构不匹配而非概念难度，语法感知的干预手段可有效揭示并减轻这类归纳偏差。 Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Shicheng Liu,Kai Sun,Lisheng Fu,Xilun Chen,Xinyuan Zhang,Zhaojiang Lin,Rulin Shao,Yue Liu,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: 本文提出了SCRIBES，一种基于强化学习的大规模半结构化网页内容提取框架，利用网站内页面布局的相似性生成可重用的提取脚本，并通过在CommonCrawl数据上迭代训练提升性能，在脚本质量和下游任务准确率上均显著优于现有方法。

Details

Motivation: 网页中的表格、列表和信息框等半结构化内容包含大量事实数据，但其格式复杂，难以有效提取；现有方法要么泛化能力差，要么因逐页使用大模型推理而资源消耗高。 Method: 提出SCRIBES框架，采用强化学习，以同一网站内页面布局的相似性作为奖励信号，生成可复用的提取脚本；通过在真实网络数据（CommonCrawl）生成的合成标注上进行迭代训练优化模型。 Result: 实验表明，该方法在脚本质量上比强基线高出13%以上，并使GPT-4o在下游问答任务中的准确率提升超过4%。 Conclusion: SCRIBES实现了高效、可扩展且资源友好的大规模网页信息提取，通过可重用脚本和基于布局相似性的强化学习策略，显著提升了半结构化数据抽取的性能。 Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Ece Takmaz,Lisa Bylinina,Jakub Dotlacil

Main category: cs.CL

TL;DR: 本文提出了一种在低资源环境下开发语言模型的方法，通过模型融合技术，在保持多模态性能的同时改善了多模态模型在纯语言任务上的表现。

Details

Motivation: 现有的视觉-语言模型参数量大、依赖大规模数据，远超儿童学习语言时接触的数据量，因此需要探索更符合儿童语言习得规律的低资源多模态模型。 Method: 构建语言专用和多模态模型，并使用加权线性插值进行模型融合，以在低资源条件下提升性能。 Result: 多模态模型在语言任务上表现较差，尤其在语法相关基准上；通过与纯语言模型融合，能在保持多模态能力的同时部分缓解这一问题。 Conclusion: 模型融合是一种有效策略，可在不牺牲多模态性能的前提下增强多模态模型的语言理解能力。 Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Yisu Wang,Ming Wang,Haoyuan Song,Wenjie Huang,Chaozheng Wang,Yi Xie,Xuming Ran

Main category: cs.CL

TL;DR: 提出REPAIR框架，通过渐进式自适应干预与重新整合，实现大模型的高效、低副作用持续编辑。

Details

Motivation: 解决大语言模型在后训练阶段知识更新成本高、易产生副作用的问题。 Method: 设计REPAIR框架，采用闭环反馈机制、动态内存管理、频繁知识融合和强局部性保护，以实现稳定且精确的模型编辑。 Result: 实验表明，REPAIR在多个模型家族上编辑准确率提升10%-30%，显著减少知识遗忘。 Conclusion: REPAIR为构建可靠、可扩展且持续演进的大语言模型提供了一种鲁棒的编辑方案。 Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu,Hao Xu,Xuhong Chen,Wei Chen,Yee Whye Teh,Ning Miao

Main category: cs.CL

TL;DR: 本文系统介绍了用于提升大语言模型（LLM）推理能力的奖励模型（RM），综述了其架构、训练方法、评估技术及其在推理、数据合成和强化学习微调中的应用，并探讨了RM在选择、泛化、评估和增强方面的开放问题。

Details

Motivation: 奖励模型在提升大语言模型推理能力方面发挥关键作用，但缺乏系统性介绍和全面的应用综述，需要总结现有研究并提出未来方向。 Method: 本文通过文献综述的方式，系统梳理了奖励模型的基本概念、架构设计、训练与评估方法，并分类总结了其在LLM推理中的三大应用场景，同时结合已有研究和实证结果讨论了关键开放问题。 Result: 提供了关于奖励模型在LLM推理中应用的全面综述，明确了其核心作用和当前挑战，特别是在泛化能力、评估标准和改进策略方面存在亟待解决的问题。 Conclusion: 奖励模型是提升LLM推理性能的关键组件，未来需在可解释性、跨任务泛化和高效训练方面进一步研究，以实现更有效和可靠的部署。 Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

[69] Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli,Simone Sestito,Iacopo Masi

Main category: cs.CL

TL;DR: 提出了一种名为逆向语言建模（ILM）的统一框架，旨在提升大语言模型（LLM）对输入扰动的鲁棒性，并通过反转模型输出实现原生接地，识别潜在有害输入，使LLM更可控、可信。

Details

Motivation: 当前针对大语言模型（LLM）的防御机制零散且不成熟，缺乏像传统分类器那样系统的对抗鲁棒性方法，因此需要一个统一框架来增强LLM的安全性和可控性。 Method: 提出逆向语言建模（ILM）框架，通过同时优化模型对输入扰动的鲁棒性，并反转模型输出以追溯并识别可能导致不安全行为的原始输入触发词，从而实现鲁棒性和原生接地。 Result: ILM成功将LLM从静态生成器转变为可分析、更鲁棒的系统，具备识别潜在有毒输入的能力，支持红队测试（RED Teaming），并在提升模型安全性方面展现出潜力。 Conclusion: ILM为下一代更鲁棒、可接地、可控且可信的大语言模型奠定了基础，推动了LLM在对抗环境下的安全发展。 Abstract: The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Qi He,Cheng Qian,Xiusi Chen,Bingxiang He,Yi R.,Fung,Heng Ji

Main category: cs.CL

TL;DR: 本文提出了Veri-R1，一种基于在线强化学习的框架，使大语言模型能够通过与搜索引擎交互并接收奖励信号来提升声明验证中的规划、检索和推理能力。

Details

Motivation: 现有方法主要依赖提示工程或预设推理流程，缺乏统一的训练范式来提升声明验证所需的综合技能。 Method: 提出Veri-R1框架，采用在线强化学习让大语言模型与搜索引擎动态交互，并通过显式的奖励信号优化其规划、检索和推理行为。 Result: 实验结果显示，Veri-R1在联合准确率上最高提升30%，证据得分翻倍，且常优于更大规模的模型；消融研究揭示了奖励组件的影响及输出logits与标签准确率的关系。 Conclusion: 在线强化学习能有效提升大语言模型在声明验证中的精确性和忠实性，为未来研究提供了基础。 Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

Adina Nicola Dobrinoiu,Ana Cristiana Marcu,Amir Homayounirad,Luciano Cavalcante Siebert,Enrico Liscio

Main category: cs.CL

TL;DR: 本研究探讨了语言模型是否能通过多维度主观注释（情感、情绪、论点和话题）来预测个体的价值观解释，结果表明结合所有维度的信息在零样本和少样本设置下表现更优。

Details

Motivation: 由于价值观理解具有主观性且受社会文化背景影响，为避免AI系统偏向主流观点，需识别个体差异并实现与多元人类价值观的对齐。 Method: 利用情感、情绪、论点和话题（SEAT）四个维度的主观注释作为个体解释视角的代理，评估语言模型在不同零样本和少样本设置下预测个体价值观的能力。 Result: 同时提供所有SEAT维度信息时模型性能最优，优于单一维度或无个体信息的基线；不同标注者之间的差异凸显了考虑个体主观性的重要性。 Conclusion: 这是首次在控制环境下探索标注行为对价值观预测的影响，虽规模较小，但为未来大规模验证奠定了基础。 Abstract: Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.

[72] Exploring Database Normalization Effects on SQL Generation

Ryosuke Kohita

Main category: cs.CL

TL;DR: 本研究首次系统探讨了数据库模式规范化对自然语言转SQL（NL2SQL）性能的影响，发现去规范化模式在简单查询中表现更好，而规范化模式在聚合查询中更优，建议根据查询类型选择合适的模式设计。

Details

Motivation: 现有NL2SQL研究多忽略模式设计的影响，尤其是在不同规范化级别下的性能变化，因此需要系统评估模式设计对模型表现的作用。 Method: 构建具有不同规范化级别（1NF-3NF）的合成数据集和真实学术论文数据集，评估八种主流大语言模型在零样本和少样本设置下的表现。 Result: 去规范化模式在简单检索查询中准确率高，尤其适用于低成本模型；规范化模式在聚合查询中表现更好，能有效避免数据重复和NULL值问题，但需少样本示例来缓解连接错误。 Conclusion: NL2SQL系统的最优模式设计取决于目标查询类型，应根据应用场景自适应选择模式，并在实际部署中考虑模式设计的影响。 Abstract: Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization's impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Md Arid Hasan,Firoj Alam,Md Fahad Hossain,Usman Naseem,Syed Ishtiaque Ahmed

Main category: cs.CL

TL;DR: 本文提出了首个用于孟加拉语的多任务仇恨言论数据集BanglaMultiHate，并通过多种模型对比实验，探讨了在低资源环境下大型语言模型的适应性，强调了文化与语言背景预训练的重要性。

Details

Motivation: 现有针对低资源语言如孟加拉语的仇恨言论检测研究多为单任务且覆盖范围有限，缺乏对类型、严重程度和目标等多维度信号的综合识别，因此需要构建更全面的数据集和评估方法。 Method: 构建了大规模人工标注的多任务数据集BanglaMultiHate，系统比较了经典基线、单语预训练模型和大语言模型（零样本提示与LoRA微调）的表现。 Result: 实验表明，尽管LoRA微调的大模型表现接近BanglaBERT，但具有文化与语言基础的预训练对提升性能至关重要。 Conclusion: 该数据集和研究结果为低资源语言环境下开发文化适配的内容审核工具提供了更强的基准支持。 Abstract: Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Donghoon Jung,Jiwoo Choi,Songeun Chae,Seohyon Jung

Main category: cs.CL

TL;DR: 本研究采用叙事学视角，通过约束性决策机制分析大语言模型（LLM）作为计算作者的创作过程，发现模型在创作中普遍优先考虑风格而非人物、事件或背景，并揭示了不同模型在创造性偏好上的独特特征。

Details

Motivation: 现有对大语言模型创造力的评估多关注输出质量，而忽视其生成过程。本文旨在从创作过程角度，探索LLM作为‘计算作者’的创造性行为。 Method: 引入基于约束的决策框架，结合受控提示技术赋予模型作者角色，并利用叙事学方法分析模型在风格、人物、事件和场景等维度的创作选择及其解释逻辑。 Result: 研究发现，LLMs在创作决策中一致优先考虑‘风格’；不同模型展现出可识别的创造性偏好模式；通过分析其决策理由，可识别出模型间的差异化创作画像。 Conclusion: 该方法为系统评估AI的作者性创造力提供了新的分析工具，强调应从创作过程而非仅输出结果来理解LLM的创造力。 Abstract: Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Siddhant Arora,Haidar Khan,Kai Sun,Xin Luna Dong,Sajal Choudhary,Seungwhan Moon,Xinyuan Zhang,Adithya Sagar,Surya Teja Appini,Kaushik Patnaik,Sanat Sharma,Shinji Watanabe,Anuj Kumar,Ahmed Aly,Yue Liu,Florian Metze,Zhaojiang Lin

Main category: cs.CL

TL;DR: 提出Streaming RAG框架，通过在用户说话时并行预测工具查询，降低语音对话系统中工具调用的延迟，提升准确性和响应性。

Details

Motivation: 端到端语音对话系统易因事实依据不足而产生幻觉，需引入工具增强，但传统工具集成会显著增加响应延迟。 Method: 开发一种后训练流程，使模型能在用户语音输入过程中动态决定何时调用工具，并生成融合音频输入与检索文本结果的语音回复。 Result: 在AudioCRAG基准上，问答准确率相对提升200%（绝对值从11.1%升至34.2%），工具使用延迟减少20%。 Conclusion: Streaming RAG有效平衡了准确性与低延迟，支持语音和文本输入，推动了实时、具身AI助手的发展。 Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Hala Sheta,Eric Huang,Shuyu Wu,Ilia Alenabi,Jiajun Hong,Ryker Lin,Ruoxi Ning,Daniel Wei,Jialin Yang,Jiawei Zhou,Ziqiao Ma,Freda Shi

Main category: cs.CL

TL;DR: VLM-Lens 是一个开源工具包，用于系统化地评估、分析和解释视觉-语言模型（VLM），支持提取任意层的中间输出，提供统一且可扩展的接口。

Details

Motivation: 为了促进对视觉-语言模型内部机制的理解，需要一个能够统一访问和分析不同VLM中间表示的工具。 Method: 设计了一个基于YAML配置的通用接口，抽象化不同VLM的实现细节，支持16种主流VLM及其30多种变体，并允许无缝集成新的模型和分析方法。 Result: 成功实现了对多种VLM的跨层和跨概念隐藏表征差异的分析实验，验证了工具的有效性和灵活性。 Conclusion: VLM-Lens 提供了一个灵活、易用且可扩展的平台，有助于推动VLM的可解释性研究和社区发展。 Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Siddhant Arora,Jinchuan Tian,Hayato Futami,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe

Main category: cs.CL

TL;DR: 提出了一种名为SCoT的流式思维链（CoT）框架，用于双工端到端对话系统，通过分块处理用户输入和生成响应，提升了连贯性、可解释性，并支持低延迟和重叠交互。

Details

Motivation: 现有双工对话系统在语义推理上表现不足，且依赖复杂的双通道结构，同时传统VAD难以区分停顿与话轮结束，影响对话流畅性。 Method: 采用流式思维链（SCoT）框架，将用户输入按固定时长分块处理，并利用帧级对齐生成中间目标（如对齐的转录和响应），实现连续响应生成而无需显式VAD。 Result: 实验表明，该方法相比现有双工系统能生成更连贯、可解释的响应，且在延迟和重叠交互方面优于逐轮对话系统。 Conclusion: SCoT框架有效平衡了低延迟、语义连贯性和系统简洁性，为端到端对话系统提供了更具潜力的双工解决方案。 Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

[78] The Disparate Impacts of Speculative Decoding

Jameson Sandler,Ahmet Üstün,Marco Romanelli,Sara Hooker,Ferdinando Fioretto

Main category: cs.CL

TL;DR: 本文分析了推测解码在不同任务中的加速效果，发现其对欠拟合和代表性不足的任务加速效果较差，存在不公平性；通过理论分析揭示了原因，并提出了一种缓解策略，在多个模型组合上平均提升了12%的公平性指标。

Details

Motivation: 推测解码虽能加速大语言模型推理，但其在不同任务间的加速效果可能存在差异，尤其可能加剧对欠拟合或代表性不足任务的不公平，因此需要探究其潜在偏见并加以缓解。 Method: 通过理论分析量化推测解码在不同任务上的加速不公平性，识别导致差异的关键因素，并基于这些洞察提出一种新的缓解策略，在多个模型组合上进行实验验证。 Result: 实验证明推测解码的加速效果在不同任务上不均衡，对欠拟合任务效果更差；所提策略有效减少了加速差异，在多个模型组合中平均提升了12%的公平性指标。 Conclusion: 推测解码存在任务相关的加速不公平问题，不能一视同仁地提升所有任务的效率；通过针对性的策略可以显著改善这种不公平，为更公平高效的推理技术提供了方向。 Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

[79] RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu,Will Su,Leitian Tao,Haozhu Wang,Aashu Singh,Hanchao Yu,Jianyu Wang,Hongyang Gao,Weizhe Yuan,Jason Weston,Ping Yu,Jing Xu

Main category: cs.CL

TL;DR: RESTRAIN是一种无需黄金标签的强化学习框架，通过自惩罚机制利用未标注数据实现模型的自我改进，在多个复杂推理任务上显著提升性能。

Details

Motivation: 现有基于人类标注数据的强化学习在复杂任务上表现不佳且成本高昂，需要一种无需标注数据即可持续提升模型推理能力的方法。 Method: 提出RESTRAIN框架，利用模型自身答案分布中的信号（如过度自信和低一致性）进行自惩罚，并结合策略优化方法（如GRPO）实现无监督下的持续自我改进。 Result: 在AIME25、MMLU_STEM和GPQA-Diamond等基准上，使用Qwen3-4B-Base和OctoThinker Hybrid-8B-Base模型分别提升了+140.7%、+36.2%和+19.6%的Pass@1分数，性能接近使用黄金标签训练的结果。 Conclusion: RESTRAIN为无需黄金标签的强推理能力提供了可扩展的路径，展示了无监督自我改进在复杂推理任务上的巨大潜力。 Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

[80] Learning to Reason for Hallucination Span Detection

Hsuan Su,Ting-Yao Hu,Hema Swetha Koppula,Kundan Krishna,Hadi Pouransari,Cheng-Yu Hsieh,Cem Koc,Joseph Yitan Cheng,Oncel Tuzel,Raviteja Vemulapalli

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的框架RL4HS，用于检测大语言模型生成内容中的幻觉片段，通过引入细粒度奖励机制显著优于现有方法。

Details

Motivation: 现有的幻觉检测多为二分类任务，而实际应用中需要识别具体的幻觉片段，因此需要更精细的检测方法。 Method: 提出RL4HS框架，结合链式思维推理与强化学习，采用基于组相对策略优化和类别感知策略优化的细粒度奖励函数来训练模型。 Result: 在RAGTruth基准（摘要、问答、数据到文本）上的实验表明，RL4HS优于预训练推理模型和监督微调方法。 Conclusion: 使用细粒度奖励的强化学习对于检测幻觉片段是必要且有效的，显著提升了检测性能。 Abstract: Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

[81] ARUQULA -- An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities

Felix Brei,Lorenz Bühmann,Johannes Frey,Daniel Gerber,Lars-Peter Meyer,Claus Stadler,Kirill Bulert

Main category: cs.CL

TL;DR: 本文提出了一种基于SPINACH的通用方法，通过大语言模型将自然语言问题逐步转化为SPARQL查询，以降低知识图谱查询的门槛。

Details

Motivation: 为了降低非计算机专业人员使用SPARQL查询知识图谱的难度，并响应Text2SPARQL挑战赛，推动该领域的发展。 Method: 采用大语言模型（LLM）作为支持，设计了一个名为SPINACH的代理，通过迭代式的探索与执行过程，实现自然语言到SPARQL查询的翻译。 Result: 描述了系统的整体架构和设计思路，并对代理行为进行了深入分析，揭示了未来可改进的方向。 Conclusion: 该方法能够有效降低Text2SPARQL的使用门槛，为后续优化提供了有价值的见解。 Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.

[82] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Lingzhong Dong,Ziqi Zhou,Shuaibo Yang,Haiyue Sheng,Pengzhou Cheng,Zongru Wu,Zheng Wu,Gongshen Liu,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的评估框架，用于诊断基于视觉语言模型的移动使用代理中的推理-执行差距，核心是通过“真实对齐”（GTA）度量来判断链式思维推理是否与真实动作一致，结合精确匹配（EM）指标，揭示了执行差距和推理差距的存在，并发现即使在大型模型中仍存在显著的执行差距。

Details

Motivation: 现有研究关注执行准确性，但忽视了链式思维（CoT）推理是否与真实操作对齐，可能导致用户过度信任看似合理的推理而授权有害行为，因此需要评估推理与执行之间的一致性以提升系统可信度。 Method: 提出Ground-Truth Alignment（GTA）指标，衡量CoT推理隐含的动作是否与真实动作一致，并结合Exact Match（EM）指标进行联合评估，识别出执行差距（EG）和推理差距（RG）两种类型。 Result: 实验表明推理-执行差距普遍存在，其中执行差距比推理差距更频繁；增大模型规模可减少整体差距，但最大模型仍存在显著执行差距；该框架能可靠反映当前最先进模型中的系统性EG/RG模式。 Conclusion: 所提出的评估框架有助于诊断移动使用代理中的推理与执行不一致问题，为构建更可信的智能代理提供了具体指导和改进方向。 Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

[83] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Xiaoyang Yuan,Yujuan Ding,Yi Bin,Wenqi Shao,Jinyu Cai,Jingkuan Song,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: 提出自适应多引导策略优化（AMPO），通过在需要时从多个教师模型获取指导，提升大语言模型的推理能力与泛化性。

Details

Motivation: 现有强化学习方法依赖单一教师或自我探索生成长思维链，易引入模型偏差并限制推理多样性，需更高效、可扩展的推理增强方法。 Method: 引入AMPO框架，采用多教师指导机制，仅在策略模型失败时按需引入教师引导，并结合基于理解能力的选择机制，选择学生最可能理解的推理路径进行学习。 Result: 在数学推理任务上比基线GRPO提升4.3%，分布外任务提升12.2%，显著提高Pass@k性能和推理多样性；使用四个同规模教师即可达到更强教师模型的性能水平。 Conclusion: AMPO提供了一种更高效、可扩展的路径来提升大语言模型的推理能力和泛化性，验证了多教师协同与按需指导的有效性。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

[84] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches

Ebtesam Jaber Aljohani,Wael M. S. Yafoo

Main category: cs.CL

TL;DR: 本文提出并评估了多种深度学习模型用于检测阿拉伯语网络欺凌，构建了一个包含10,662条X平台帖子的数据集，并通过实验发现Bi-LSTM结合FastText词嵌入在检测准确率上达到98%，表现最优。

Details

Motivation: 由于针对阿拉伯语网络欺凌的自动检测方法较少，且青少年在网络平台上面临情感健康风险，因此有必要开发有效的阿拉伯语网络欺凌检测技术。 Method: 收集并预处理10,662条阿拉伯语X平台帖子，使用kappa工具提升标注质量；采用LSTM、Bi-LSTM与不同词嵌入（如FastText）及预训练BERT模型进行四组实验，比较其在检测任务中的性能。 Result: LSTM-BERT和Bi-LSTM-BERT模型均达到97%的准确率，而Bi-LSTM结合FastText词嵌入的表现更优，准确率达到98%。 Conclusion: Bi-LSTM结合FastText是目前检测阿拉伯语网络欺凌最有效的方法，结果具有良好的泛化能力，为阿拉伯语社交媒体内容安全提供了可行的技术方案。 Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize

[85] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Linh The Nguyen,Chi Tran,Dung Ngoc Nguyen,Van-Cuong Pham,Hoang Ngo,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: 提出AccurateRAG框架，用于构建高性能的检索增强生成（RAG）问答系统，支持本地开发与优化，在基准数据集上达到SOTA性能。

Details

Motivation: 提升RAG系统的开发效率和问答性能，解决现有方法在数据处理、模型微调和系统集成方面的不足。 Method: 设计端到端的RAG开发流程，涵盖数据预处理、微调数据生成、文本嵌入、大模型微调、输出评估等模块，并支持本地部署。 Result: 在多个基准问答数据集上超越先前强基线方法，取得新的SOTA结果。 Conclusion: AccurateRAG为高效构建高性能RAG系统提供了实用且可扩展的解决方案。 Abstract: We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

[86] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Tianyi Jiang,Yi Bin,Yujuan Ding,Kainian Zhu,Fei Ma,Jingkuan Song,Heng Tao Shen

Main category: cs.CL

TL;DR: 提出了一种新的推理范式“探索后决定”，通过累积熵调节机制（CER）和新指标TECA动态控制推理深度，有效减少大模型在简单问题上的过度思考，提升推理效率。

Details

Motivation: 大语言模型在简单问题上常出现过度推理（overthinking），导致效率低下且难以根据问题复杂度自适应调整推理深度。 Method: 引入Token Entropy Cumulative Average（TECA）作为衡量推理过程中探索程度的指标，并提出“Explore Briefly, Then Decide”推理范式及累积熵调节（CER）机制，利用TECA动态判断最佳终止点以结束推理。 Result: 在多个数学基准测试中验证了该方法能显著减少过度思考，在不牺牲求解能力的前提下，使简单数据集上的平均响应长度最多减少71%。 Conclusion: 所提出的CER机制与TECA指标能有效实现推理过程的自适应控制，在保持性能的同时大幅提升推理效率，推动大模型实现更智能的推理深度调节。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

[87] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du,Yuanshuo Zhang,Xiyuan Yang,Yifan Zhou,Cheng Wang,Gongyi Zou,Xianghe Pang,Wenhao Wang,Menglan Chen,Shuo Tang,Zhiyu Li,Siheng Chen

Main category: cs.CL

TL;DR: 本文提出了InfoMosaic-Bench，首个用于评估工具增强型代理在多源信息获取能力的基准，涵盖医学、金融、地图等六个领域，通过结合通用搜索与领域专用工具来解决复杂任务。实验表明当前大模型代理在工具使用上仍存在显著缺陷。

Details

Motivation: 现有LLM代理依赖开放网络搜索，但网络内容噪声多且不可靠，许多实际任务需要无法从网上获得的精确领域知识。尽管MCP协议使代理可接入专业工具，但其有效整合尚不明确。 Method: 提出InfoMosaic-Bench基准和InfoMosaic-Flow生成流程，合成需结合通用搜索与领域工具的任务，确保任务可靠且非平凡。 Result: 14种最先进LLM代理实验显示：仅靠网页信息表现不足（GPT-5准确率38.2%）；领域工具带来选择性但不一致的提升；22.4%失败源于工具使用或选择错误。 Conclusion: 当前工具增强型LLM代理在多源信息整合方面仍有重大挑战，尤其在正确选择和使用工具方面表现不佳，需进一步研究改进。 Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

[88] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

Wen Yang,Junhong Wu,Chong Li,Chengqing Zong,Jiajun Zhang

Main category: cs.CL

TL;DR: 本研究从跨语言视角探讨了基于英语的强化后训练（RPT）在大型推理模型（LRMs）中的推理能力是否能有效迁移到其他语言，并提出了衡量跨语言可迁移性的新指标。研究发现跨语言迁移效果受模型、语言和训练范式影响显著，英语能力强的模型更依赖英语特定模式，导致跨语言泛化能力下降。引入平行语言训练后，性能显著提升并遵循幂律扩展规律，同时提出“单语泛化差距”概念，揭示当前模型在语言间泛化上的不足，挑战了LRM推理类人认知的假设。

Details

Motivation: 现有研究多关注强化学习推理在任务或模态间的泛化，而忽视了语言层面的迁移问题。本文旨在探究英语主导的推理能力能否推广到其他语言，从而推动更具语言无关性的推理模型发展。 Method: 系统评估以英语为中心的大型推理模型在多语言推理基准上的表现，提出量化跨语言可迁移性的指标，并通过干预研究分析初始能力对迁移的影响；进一步开展多语言平行训练实验，探索不同数量平行语言下的性能变化规律。 Result: 发现跨语言迁移效果存在显著差异；英语能力强的模型更依赖英语特有模式，削弱跨语言泛化；引入单一平行语言即带来显著性能跃升（First-Parallel Leap）；性能随平行语言数量增加遵循幂律增长（Parallel Scaling Law）；提出Monolingual Generalization Gap，反映模型未能充分实现语言间泛化。 Conclusion: 英语中心的推理训练限制了模型在其他语言上的泛化能力，当前LRM的推理机制与人类认知存在差距。通过引入多语言平行训练可显著提升跨语言推理性能，且遵循可预测的扩展规律，为构建语言无关的推理模型提供了重要路径和评估标准。 Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.

[89] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang

Main category: cs.CL

TL;DR: F2LLM是一系列从基础模型直接微调而来的高效、低成本嵌入模型，包含0.6B、1.7B和4B三种规模，在MTEB榜单上表现优异。

Details

Motivation: 现有高性能嵌入模型依赖大规模对比预训练和昂贵的合成数据，训练成本高且难以复现，因此需要一种更经济、可复现的替代方案。 Method: 基于开源非合成数据构建的600万查询-文档-负样本元组，直接对基础模型进行微调，避免复杂的训练流程和合成数据生成。 Result: F2LLM-4B在约4B参数模型中排名第二，整体第七；F2LLM-1.7B在其参数规模范围内排名第一。 Conclusion: F2LLM在训练成本、模型大小和性能之间实现了良好平衡，是一个强大、可复现且预算友好的基线模型。 Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

[90] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Raphael Tang,Crystina Zhang,Wenyan Li,Carmen Lai,Pontus Stenetorp,Yao Lu

Main category: cs.CL

TL;DR: 本文质疑了在大语言模型竞技场评估中将平局视为模型能力相等的传统做法，提出平局更多反映的是问题难度而非模型水平相近。实验表明，在三种真实数据集上忽略平局的评分更新可提升1-3%的预测准确率，建议未来评分系统应重新考虑平局语义并纳入问题属性。

Details

Motivation: 现有基于Elo等系统的模型评分方法将平局视为两模型能力相等，从而调整其评分趋于一致，但作者质疑这一假设的合理性，认为平局可能更多由查询本身的容易程度或客观性导致，而非模型能力相同。 Method: 作者在三个真实世界的大语言模型竞技场数据集上，分析了平局与查询难度、主观性等因素的关系，并比较了四种主流评分系统在是否忽略平局评分更新情况下的战斗结果预测准确性。 Result: 实验显示，忽略平局带来的评分更新可在所有四种评分系统中带来1-3%的相对预测准确率提升；进一步分析发现，平局更常出现在被标记为“非常容易”和“高度客观”的查询上，风险比分别为1.37和1.35。 Conclusion: 平局不应简单解释为模型能力相等，而应被视为反映查询难度的信号；未来评分系统应重新设计平局处理机制，并考虑引入查询本身的属性来改进评分动态建模。 Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

cs.CV [Back]

[91] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

Alessio Spagnoletti,Andrés Almansa,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本文提出了LVTINO，首个基于视频一致性模型（VCMs）的零样本高清晰度视频恢复逆问题求解器，通过利用VCM蒸馏出的先验知识，在保证时间一致性和测量一致性的同时，实现了高质量、高效率的视频重建。

Details

Motivation: 现有的基于图像扩散模型逐帧处理视频的方法难以捕捉时间依赖性，导致恢复结果存在时间不一致性，因此需要一种能够同时恢复精细空间细节并建模时间动态的高效视频恢复方法。 Method: 提出LVTINO，利用近期视频一致性模型（VCMs）作为先验，这些模型从视频潜在扩散模型中蒸馏而来，能显式捕捉时间因果关系；设计了一种无需自动微分的条件机制，通过少量神经网络前向传播实现高效推理。 Result: 在多种视频逆问题上实验表明，LVTINO显著优于当前逐帧应用图像LDM的方法，在重建保真度和感知质量方面达到最先进水平，同时保持高计算效率和帧间平滑过渡。 Conclusion: LVTINO成功将零样本扩散先验扩展到高分辨率视频恢复任务，验证了基于VCM的先验在视频逆问题中的有效性，为未来高效、一致的视频恢复提供了新基准和方向。 Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.

[92] Image Generation Based on Image Style Extraction

Shuochen Chang

Main category: cs.CV

TL;DR: 提出一种基于三阶段训练的风格提取图像生成方法，通过风格编码器和投影层实现细粒度文本引导的风格化图像生成。

Details

Motivation: 现有文本到图像生成模型难以精确描述和控制细粒度风格，且参考图像的风格信息难以与文本条件对齐。 Method: 设计风格编码器和风格投影层，从单张参考图像中提取细粒度风格表示，并将其注入生成模型，保持原有结构不变；采用三阶段训练策略，并构建Style30k-captions数据集进行训练。 Result: 实现了无需修改下游生成模型结构的细粒度风格控制图像生成，有效对齐了风格表示与文本表示。 Conclusion: 该方法能有效利用预训练生成模型的能力，实现基于单张参考图像的精细风格控制图像生成，在保持内容一致性的同时提升风格迁移精度。 Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

[93] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels

Shijia Feng,Michael Wray,Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: 本文提出一个名为EvoStruggle的新数据集，包含61.68小时的视频和5,385个标注的挣扎片段，用于研究技能学习过程中挣扎的演变。通过将挣扎识别建模为时序动作定位任务，实验表明现有模型可在未见任务上检测挣扎，具备一定跨任务泛化能力（mAP 34.56%）和跨活动泛化能力（mAP 19.24%），验证了挣扎概念的可迁移性。

Details

Motivation: 准确识别技能习得过程中的挣扎对于优化人类学习和开发有效的辅助系统至关重要。现有操作数据集缺乏对挣扎随时间演变的关注，因此需要构建专门数据集来研究这一动态过程。 Method: 收集了一个包含76名参与者在四种不同活动（打结、折纸、七巧板、洗牌）中重复五次任务的数据集，共2,793个视频。将挣扎识别定义为时序动作定位任务，使用Temporal Action Localization模型进行实验，评估其在跨任务和跨活动场景下的表现。 Result: 实验结果显示模型在跨任务和跨活动设置下分别达到34.56%和19.24%的平均mAP，证明挣扎检测具有一定的可迁移性，但仍有提升空间。数据集已公开。 Conclusion: 挣扎是一种可在不同技能任务间迁移的概念，该研究为理解学习过程中的动态挑战提供了新数据和方法基础，推动个性化学习辅助系统的发展。 Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.

[94] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs

Abu Bucker Siddik,Diane Oyen,Alexander Most,Michal Kucer,Ayan Biswas

Main category: cs.CV

TL;DR: 提出了一种名为Small PDE U-Net Solver (SPUS)的紧凑高效基础模型，用于统一求解多种偏微分方程（PDEs），其基于轻量级残差U-Net架构和自回归预训练策略，在参数量少、微调数据需求低的情况下实现了最先进的泛化性能。

Details

Motivation: 现有PDE基础模型多基于复杂且计算开销大的Transformer架构，缺乏高效轻量的替代方案，限制了在资源受限场景下的应用。 Method: 采用轻量级残差U-Net作为基础架构，并设计一种简单而有效的自回归预训练策略，模拟数值求解器行为以学习物理规律；在多样化的流体动力学PDE上进行预训练，并在多种未见的下游PDE任务上评估。 Result: SPUS在6个具挑战性的未见下游PDE任务上表现出优异的泛化能力，达到最先进水平，同时参数量显著减少，且仅需极少微调数据。 Conclusion: SPUS证明了U-Net类架构在PDE基础模型中的巨大潜力，提供了一种高参数效率、低资源需求的通用PDE求解方案。 Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.

[95] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

Shubhankar Borse,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: 本文提出了DisCo，一种基于强化学习的框架，通过多样性约束解决多人体图像生成中的身份混淆问题，显著提升了身份多样性和生成质量。

Details

Motivation: 现有的文本到图像模型在生成多人场景时存在面部重复、身份混淆和人数错误等问题，缺乏有效的身份多样性优化方法。 Method: 提出DisCo框架，采用Group-Relative Policy Optimization（GRPO）微调流匹配模型，设计复合奖励函数以降低图像内面部相似性、减少跨样本身份重复、确保准确人数并保持视觉质量，结合单阶段课程学习稳定训练过程。 Result: 在DiverseHumans测试集上，DisCo实现了98.6%的独特面部准确率和接近完美的全局身份分布，优于开源及专有方法（如Gemini、GPT-Image），同时保持良好的感知质量。 Conclusion: DisCo是一种可扩展且无需额外标注的解决方案，有效解决了生成模型中的身份危机，为多人体组合生成设定了新基准。 Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

[96] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna,Nicholas Meegan,Han-Pang Chiu,Supun Samarasekera,Rakesh Kumar

Main category: cs.CV

TL;DR: 本文提出了一种新的地理表示方法，通过将查询图像的视觉表征与分层地理嵌入对齐来改进全球视觉地理定位。同时引入了外观特征与语义分割图的有效融合方法，提升了定位性能。

Details

Motivation: 现有的视觉地理定位方法在学习地理表征方面仍有不足，尤其是在全球范围内的精确定位。因此需要一种更有效的地理和视觉表征对齐机制。 Method: 将地理定位建模为视觉表征与学习到的地理表征之间的对齐问题；提出分层地理嵌入结构来建模世界，并融合查询图像的外观特征与其语义分割图以构建鲁棒的视觉表征。 Result: 在五个基准数据集的25项指标中，有22项超过了当前最先进的方法和最新的大型视觉-语言模型（LVLMs）。消融实验表明性能提升主要来自地理与视觉表征的结合。 Conclusion: 所提出的分层地理表示与多模态视觉融合策略显著提升了全球视觉地理定位的性能，验证了联合建模地理与视觉信息的重要性。 Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

Nilay Naharas,Dang Nguyen,Nesihan Bulut,Mohammadhossein Bateni,Vahab Mirrokni,Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: 本文提出了XMAS，一种基于跨模态注意力矩阵相似性的新型数据选择方法，用于高效微调大规模视觉-语言模型（LVLMs）。该方法通过聚类示例并采样平衡子集来消除训练数据冗余，在保留完整性能的同时显著减少数据量和训练时间。

Details

Motivation: 现有的数据选择方法在LVLM上表现不佳，无法超越随机选择。本文旨在提出首个有理论依据的、针对LVLM指令微调的数据高效学习方法。 Method: 基于理论证明：在指令微调过程中具有相似跨模态注意力矩阵的样本具有相似梯度，因此对模型参数更新影响相似。XMAS利用一个小的代理LVLM在微调过程中提取注意力矩阵的前几个奇异值轨迹进行聚类，并从中采样平衡子集以去除冗余。 Result: XMAS可在LLaVA-665k数据集中剔除50%数据、在Vision-Flan中剔除85%数据，同时在10个下游任务上完全保持LLaVA-1.5-7B的性能，并将训练速度提升1.2倍。相比最优基线，LLaVA-665k上的数据缩减多出30%。 Conclusion: XMAS是首个为LVLM设计的原理性数据选择方法，能有效识别并去除训练数据中的冗余，在大幅减少数据量和训练成本的同时保持模型性能，推动了数据高效型LVLM训练的发展。 Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

[98] Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan,Vincent Tao Hu,Grigory Bartosh,Björn Ommer,Cees G. M. Snoek,Max Welling,Jan-Willem van de Meent,Mohammad Mahdi Derakhshani,Floor Eijkelboom

Main category: cs.CV

TL;DR: Purrception是一种用于向量量化图像生成的变分流匹配方法，通过在连续嵌入空间中计算速度场的同时学习码本索引上的分类后验，实现了离散监督与连续传输动力学的结合。

Details

Motivation: 现有的图像生成方法在连续流匹配和离散流匹配之间存在权衡，缺乏同时利用连续方法几何感知能力和离散方法明确分类监督的优势。因此，需要一种能够融合两者优势的新方法以提升训练效率和生成质量。 Method: 提出Purrception，将变分流匹配应用于向量量化潜在空间，通过在连续嵌入空间中建模速度场并学习码本索引上的分类后验分布，实现连续动态与离散监督的结合，并支持不确定性估计和温度控制生成。 Result: 在ImageNet-1k 256x256图像生成任务上，Purrception相比连续和离散流匹配基线方法训练收敛更快，同时取得了与当前最先进模型相当的FID分数。 Conclusion: 变分流匹配能够有效融合连续传输动力学与离散分类监督，在保持生成质量的同时显著提升训练效率，为向量量化图像生成提供了一种更优的框架。 Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

[99] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging

Yuxuan Ou,Ning Bi,Jiazhen Pan,Jiancheng Yang,Boliang Yu,Usama Zidan,Regent Lee,Vicente Grau

Main category: cs.CV

TL;DR: 提出一种基于条件扩散模型的多任务学习框架，用于从非对比CT生成合成对比CT并同时分割主动脉管腔和血栓，减少造影剂使用并提高分割精度。

Details

Motivation: 减少碘化造影剂的使用及其带来的肾毒性、过敏反应和环境危害，同时克服传统多阶段方法中误差累积和语义结构利用不足的问题。 Method: 结合条件扩散模型与多任务学习，实现图像生成与解剖分割的端到端联合优化；共享编码器-解码器参数，并采用半监督策略训练，无需初始预测掩码。 Result: 在264名患者数据上验证，PSNR达25.61 dB（优于单任务CDM的23.80 dB）；管腔Dice提升至0.89（原0.87），血栓Dice达0.53（原0.48）；临床测量误差显著降低：管腔直径MAE为4.19 mm（原5.78 mm），血栓面积误差由41.45%降至33.85%。 Conclusion: 该统一框架在合成对比CT生成和主动脉结构分割方面优于现有方法，具有临床应用潜力，可减少对真实对比增强扫描的依赖。 Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.

[100] From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Basem Rizk,Joel Walsh,Mark Core,Benjamin Nye

Main category: cs.CV

TL;DR: 提出一个框架，用于高效构建多模态内容分析流水线，将视频转化为时序半结构化数据并生成可查询的知识图谱。

Details

Motivation: 融合开源预训练模型与复杂视频数据进行多模态分析存在工程和计算上的挑战。 Method: 结合多个预训练模型构建流水线，将视频转换为时序半结构化数据，并进一步转化为帧级索引的知识图谱表示。 Result: 实现了可查询且支持持续学习的知识图谱，能够通过交互方式动态融入领域知识。 Conclusion: 该框架有效支持多模态视频内容的分析与知识更新，降低了开发复杂性和计算成本。 Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

[101] WALT: Web Agents that Learn Tools

Viraj Prabhu,Yutong Dai,Matthew Fernandez,Jing Gu,Krithika Ramakrishnan,Yanqi Luo,Silvio Savarese,Caiming Xiong,Junnan Li,Zeyuan Chen,Ran Xu

Main category: cs.CV

TL;DR: WALT是一种新型的网页代理框架，通过逆向工程提取网站内置功能并封装为可调用工具（如搜索、过滤、创建等），使代理能以高阶操作而非低级UI交互完成任务，从而提升自动化效率和鲁棒性。

Details

Motivation: 现有网页代理方法依赖于脆弱的逐步UI操作和大量LLM推理，在动态布局和长周期任务中表现不佳。而人类用户通常利用网站提供的高级功能（如搜索、排序）来高效完成任务，因此需要一种更稳定、泛化能力更强的自动化范式。 Method: 提出WALT框架，通过分析网页结构和行为，自动发现并封装网站功能为可复用的工具（如search、create等）。代理通过调用这些工具进行高阶操作，避免了对点击和输入等低级动作的依赖，减少了LLM推理负担。 Result: 在VisualWebArena和WebArena两个基准上，WALT相比现有方法实现了更高的任务成功率、更少的操作步数以及更低的LLM推理依赖，验证了其在复杂浏览器任务中的有效性与鲁棒性。 Conclusion: WALT通过将网站功能抽象为可调用工具，实现了从脆弱的逐步推理到可靠工具调用的范式转变，为浏览器自动化提供了一种更高效、更具泛化能力的解决方案。 Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

[102] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

Meilong Xu,Xiaoling Hu,Shahira Abousamra,Chen Li,Chao Chen

Main category: cs.CV

TL;DR: 提出了一种半监督分割框架，通过多扰动预测和拓扑一致性约束来减少组织病理学图像中的拓扑错误，提升分割鲁棒性和准确性。

Details

Motivation: 在组织病理学图像中，无标签数据的语义结构提取困难，尤其是对象密集分布时，传统方法易受噪声干扰，难以保持关键拓扑特征。 Method: 利用随机dropout和时间训练快照生成多个扰动预测，通过结合空间重叠和全局结构对齐的新匹配策略，强制跨预测的拓扑一致性，从而保留有意义的生物结构。 Result: 实验表明该方法显著减少了拓扑错误，提升了分割的准确性和鲁棒性，在多个数据集上优于现有半监督方法。 Conclusion: 所提出的框架能有效利用无标签数据保持关键拓扑结构，为组织病理学图像分析提供了更可靠的分割结果。 Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.

[103] Towards Better Optimization For Listwise Preference in Diffusion Models

Jiamu Bai,Xin Yu,Meilong Xu,Weitao Lu,Xin Pan,Kiwan Maeng,Daniel Kifer,Jian Wang,Yu Wang

Main category: cs.CV

TL;DR: 本文提出了Diffusion-LPO，一种用于扩散模型中列表偏好优化的简单有效框架，基于Plackett-Luce模型扩展DPO目标以利用排序数据，显著优于成对DPO基线。

Details

Motivation: 现有基于成对偏好的强化学习方法无法充分利用人类反馈中的排序信息，限制了对齐精度。 Method: 提出Diffusion-LPO框架，在Plackett-Luce模型下将DPO目标扩展为列表形式，利用图像排序数据进行优化。 Result: 在文本到图像生成、图像编辑和个性化偏好对齐等多个任务上，Diffusion-LPO在视觉质量和偏好对齐方面均优于成对DPO基线。 Conclusion: Diffusion-LPO能更精确地建模人类偏好，有效提升扩散模型与人类偏好的对齐效果。 Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

[104] Growing Visual Generative Capacity for Pre-Trained MLLMs

Hanyu Wang,Jiaming Han,Ziyan Yang,Qi Zhao,Shanchuan Lin,Xiangyu Yue,Abhinav Shrivastava,Zhenheng Yang,Hao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Bridge的纯自回归统一多模态大语言模型，通过Mixture-of-Transformers架构和语义到像素的离散表示，在单个框架下实现了图像理解和生成，兼顾语义对齐与像素保真度，并在多个基准上取得了优于或媲美先前模型的效果，同时训练数据和时间更少。

Details

Motivation: 现有的多模态大语言模型在统一理解与生成任务时面临挑战：混合方法虽能生成高质量图像但打破了自回归范式，而纯自回归方法则在语义对齐与像素保真之间存在权衡。因此，需要一种既能保持自回归特性又能高效支持多模态理解与生成的统一模型。 Method: 提出Bridge模型，采用Mixture-of-Transformers架构，在预训练视觉理解模型基础上增强生成能力；设计语义到像素的离散表示，结合紧凑的语义标记和细粒度的像素标记，实现高质量图像生成与良好语言对齐。 Result: Bridge在多个多模态理解与生成基准上表现优异，达到或超过了现有统一MLLM的性能，同时序列长度仅增加7.9%，且所需训练数据更少、训练时间更短。 Conclusion: Bridge成功实现了纯自回归框架下的统一多模态理解与生成，在保持模型一致性的同时提升了生成质量和语义对齐能力，为构建高效统一的多模态模型提供了新思路。 Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

[105] Robust Classification of Oral Cancer with Limited Training Data

Akshay Bhagwan Sonawane,Lena D. Swamikannan,Lakshman Tamil

Main category: cs.CV

TL;DR: 提出一种结合CNN与贝叶斯深度学习的混合模型，用于小样本下的口腔癌分类，通过变分推断实现不确定性量化，显著提升模型在数据稀缺场景下的可靠性与泛化能力。

Details

Motivation: 传统深度学习模型在小样本下易过拟合且缺乏可靠性，难以满足医疗资源匮乏地区早期口腔癌诊断的需求。 Method: 将卷积神经网络（CNN）与贝叶斯深度学习结合，采用变分推断进行不确定性建模，利用智能手机拍摄的彩色照片训练模型，并在三个不同测试集上评估性能。 Result: 在训练分布相似的数据集上达到94%准确率；在真实世界多变数据上，相比传统CNN的72.94%，本模型取得88%的准确率，并展现出对正确样本低不确定性和错误样本高不确定性的良好置信度特性。 Conclusion: 贝叶斯深度学习能有效提升小样本条件下模型的可靠性与泛化性能，适用于资源受限环境中的早期口腔癌筛查。 Abstract: Oral cancer ranks among the most prevalent cancers globally, with a particularly high mortality rate in regions lacking adequate healthcare access. Early diagnosis is crucial for reducing mortality; however, challenges persist due to limited oral health programs, inadequate infrastructure, and a shortage of healthcare practitioners. Conventional deep learning models, while promising, often rely on point estimates, leading to overconfidence and reduced reliability. Critically, these models require large datasets to mitigate overfitting and ensure generalizability, an unrealistic demand in settings with limited training data. To address these issues, we propose a hybrid model that combines a convolutional neural network (CNN) with Bayesian deep learning for oral cancer classification using small training sets. This approach employs variational inference to enhance reliability through uncertainty quantification. The model was trained on photographic color images captured by smartphones and evaluated on three distinct test datasets. The proposed method achieved 94% accuracy on a test dataset with a distribution similar to that of the training data, comparable to traditional CNN performance. Notably, for real-world photographic image data, despite limitations and variations differing from the training dataset, the proposed model demonstrated superior generalizability, achieving 88% accuracy on diverse datasets compared to 72.94% for traditional CNNs, even with a smaller dataset. Confidence analysis revealed that the model exhibits low uncertainty (high confidence) for correctly classified samples and high uncertainty (low confidence) for misclassified samples. These results underscore the effectiveness of Bayesian inference in data-scarce environments in enhancing early oral cancer diagnosis by improving model reliability and generalizability.

[106] Consistent Assistant Domains Transformer for Source-free Domain Adaptation

Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,and Jun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CADTrans的源域无关域适应方法，通过构建一致性辅助域和多核最大均值差异策略，有效提升目标域上的特征对齐与分类性能。

Details

Motivation: 由于无法访问源域数据，现有方法难以获取不变特征，且易受难样本和领域偏移影响，因此需要一种无需源数据即可构建稳定不变特征表示的方法。 Method: 提出CADTrans模型，引入辅助域模块从中间全局注意力中提取多样化表示，并采用多种一致性策略获得不变特征；结合条件多核最大均值差异（CMK-MMD）对难样本进行对齐。 Result: 在Office-31、Office-Home、VISDA-C和DomainNet-126等多个基准上实现了显著的性能提升，验证了方法的有效性。 Conclusion: CADTrans通过构造一致性辅助域和改进的对齐策略，在不依赖源域数据的情况下有效提升了域适应性能，尤其增强了对难样本的处理能力。 Abstract: Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.

[107] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Ricardo Gonzalez Penuela,Felipe Arias-Russi,Victor Capriles

Main category: cs.CV

TL;DR: 本文提出一种基于历史盲人用户问题的上下文感知系统，用于指导多模态大语言模型生成更相关的图像描述，实验表明该方法能有效提高信息相关性和用户偏好。

Details

Motivation: 现有的多模态大语言模型在为盲人和低视力（BLV）用户提供图像描述时，通常生成冗长且不具上下文针对性的内容，导致信息获取效率低下。 Method: 利用VizWiz-LF数据集中BLV用户的过往提问，识别输入图像的相似视觉上下文，并以此引导MLLM生成更具上下文相关性的描述。 Result: 在92个样本上的评估显示，上下文感知描述在76.1%的情况下（70/92）成功预测并回答了用户问题，在54.4%的对比中（50/92）被标注者更偏好。 Conclusion: 通过引入BLV用户的历史提问作为上下文引导，可显著提升MLLM生成描述的相关性和实用性，从而改善视觉辅助系统的用户体验。 Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

[108] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Krishna Teja Chitty-Venkata,Murali Emani

Main category: cs.CV

TL;DR: 本文提出了ImageNet-Think，一个用于提升视觉语言模型（VLM）显式推理能力的多模态推理数据集，基于25万张ImageNet21k图像，由两个先进VLM生成结构化思维标记和答案，包含两组思考-回答序列，旨在促进更鲁棒的VLM发展及对多模态推理机制的理解。

Details

Motivation: 为了推动具备显式推理能力的视觉语言模型的发展，并深入理解其多模态推理机制，需要高质量、结构化的推理数据集。 Method: 基于ImageNet21k的25万张图像，利用GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506两个先进的视觉语言模型生成结构化的思考标记与对应答案，每张图像配有两组思考-回答序列。 Result: 构建了一个包含结构化思维过程和最终描述性答案的多模态推理数据集ImageNet-Think，可用于训练和评估具备推理能力的VLM。 Conclusion: ImageNet-Think有助于开发更具鲁棒性的视觉语言模型，并推动对多模态推理机制的理解，数据集和评估基准将公开以支持相关研究。 Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

[109] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems

Roman Jacome,Romario Gualdrón-Hurtado,Leon Suarez,Henry Arguello

Main category: cs.CV

TL;DR: 提出一种名为非线性零空间投影（NPN）的新正则化方法，通过神经网络将解约束在感知矩阵零空间的低维投影中，提升各类成像反问题的重建精度。

Details

Motivation: 传统先验通常忽略零空间的任务特定结构，导致重建性能受限，本文旨在利用零空间本身的结构信息来设计更有效的正则化方法。 Method: 提出非线性零空间投影（NPN），使用神经网络学习感知矩阵零空间的低维表示，并将其作为正则项融入重建过程，适用于即插即用方法、展开网络等多种框架。 Result: 理论分析证明了收敛性和重建准确性，实验表明NPN在压缩感知、去模糊、超分辨率、CT和MRI等多种成像任务中均能显著提升重建质量。 Conclusion: NPN通过利用零空间的结构信息提供了一种可解释且灵活的正则化策略，能有效增强多种成像反问题的重建性能。 Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix's null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.

[110] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics

Zijun Li,Jinchang Zhang,Ming Zhang,Guoyu Lu

Main category: cs.CV

TL;DR: 提出一种结合混沌游戏表示法（CGR）和概念瓶颈模型（CBM）的自动化基因组解释模块，通过生物学有意义的概念实现可解释的HIV亚型分类，并集成决策推荐层以优化临床效用和成本效益。

Details

Motivation: 为了提升基因组数据在医学自动化和机器人系统中的可解释性与实用性，解决传统模型缺乏生物学可解释性和临床决策整合能力的问题。 Method: 采用混沌游戏表示法（CGR）将DNA序列转化为图像表示，结合概念瓶颈模型（CBM），引入GC含量、CpG密度和k-mer基序等生物学概念；并通过概念保真监督、先验一致性对齐、KL分布匹配和不确定性校准提升模型可靠性；最后通过成本感知推荐层生成决策策略。 Result: 在内部和LANL数据集上实现了最先进的HIV亚型分类性能，显著优于基线模型；同时表现出更强的概念预测保真度、更好的校准性以及更优的成本效益权衡。 Conclusion: 该框架成功连接了可解释基因组建模与自动化决策，为基因组医学中的机器人和临床自动化提供了可靠基础。 Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.

[111] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Angen Ye,Zeyu Zhang,Boyuan Wang,Xiaofeng Wang,Dapeng Zhang,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出了VLA-R1，一种增强推理能力的视觉-语言-动作（VLA）模型，通过引入可验证奖励的强化学习（RLVR）和组相对策略优化（GRPO），在多场景下实现了优于现有方法的泛化能力和真实世界性能。

Details

Motivation: 现有的VLA模型缺乏显式的逐步推理机制，且训练流程中对推理质量的强化不足，难以满足复杂任务中对动作合理性和几何约束的需求。 Method: 提出VLA-R1模型，结合RLVR与GRPO进行后训练优化；设计针对区域对齐、轨迹一致性和输出格式的可验证奖励，并构建高质量数据集VLA-CoT-13K以提供链式思维监督。 Result: 在领域内、跨域、仿真和真实机器人平台上广泛评估，VLA-R1在推理鲁棒性和执行准确性方面均表现出优越性能，显著提升跨场景泛化能力。 Conclusion: 通过引入可验证奖励的强化学习和高质量推理数据，VLA-R1有效增强了VLA模型的推理与执行能力，为具身智能提供了更可靠的解决方案。 Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

[112] Joint Deblurring and 3D Reconstruction for Macrophotography

Yifan Zhao,Liangchen Li,Yuqi Zhou,Kai Wang,Yan Liang,Juyong Zhang

Main category: cs.CV

TL;DR: 提出一种针对微距摄影的联合去模糊与3D重建方法，通过可微渲染实现自监督优化，仅需少量多视角模糊图像即可实现高质量去模糊和高保真3D重建。

Details

Motivation: 微距摄影中离焦模糊问题严重影响成像清晰度和3D重建质量，传统去模糊方法依赖大量图像和标注，且缺乏适用于微距摄影的多视图3D重建方法。 Method: 基于多视角模糊图像，联合优化物体的清晰3D模型和每个像素的离焦模糊核，采用可微渲染方法进行自监督优化。 Result: 实验表明，该方法仅需少量多视角图像，即可实现高质量的图像去模糊和高保真度的3D外观重建。 Conclusion: 所提方法有效解决了微距摄影中的离焦模糊问题，在少输入条件下实现了优异的去模糊与3D重建性能，推动了微距图像处理与三维重建的发展。 Abstract: Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.

[113] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring

Xiaoyang Liu,Zhengyan Zhou,Zihang Xu,Jiezhang Cao,Zheng Chen,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出FideDiff，一种用于高保真图像去模糊的新型单步扩散模型，通过重构匹配模糊轨迹的训练数据并引入Kernel ControlNet与自适应时间步预测，实现了高质量、快速的去模糊效果。

Details

Motivation: 尽管基于扩散模型的方法在图像恢复中表现出强大生成能力，但推理时间长和保真度不足限制了其应用。因此，需要一种高效且高保真的去模糊方法。 Method: 将运动去模糊建模为扩散过程，每个时间步代表逐渐模糊的图像；训练一致性模型使所有时间步对齐到同一清晰图像，并结合Kernel ControlNet进行模糊核估计及自适应时间步预测。 Result: FideDiff在全参考指标上优于先前的扩散方法，性能媲美其他最先进模型，同时实现快速单步去模糊。 Conclusion: FideDiff为预训练扩散模型在高保真图像恢复中的应用提供了新方向，建立了面向实际工业应用的强健基线。 Abstract: Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be available at https://github.com/xyLiu339/FideDiff.

[114] LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

Rixin Zhou,Peiqiang Qiu,Qian Zhang,Chuntao Li,Xi Yang

Main category: cs.CV

TL;DR: 提出了一种基于LadderMoE增强的两阶段检测-识别管道，用于青铜器铭文的跨域、长尾识别，显著优于现有方法。

Details

Motivation: 青铜器铭文自动识别面临严重视觉退化、多域差异和极长尾字符分布等挑战，现有方法难以应对。 Method: 构建了一个包含22454张全页图像和198598个标注字符的大规模数据集，并提出两阶段检测-识别流程，结合LadderMoE模块增强预训练CLIP编码器，实现动态专家专业化以提升跨域和稀有类别识别能力。 Result: 在单字符和全页识别任务上显著优于最先进的场景文本识别基线，在头部、中部和尾部类别及所有采集模态上均取得更优准确率。 Conclusion: 所提方法为青铜器铭文识别及后续考古分析建立了坚实基础。 Abstract: Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

[115] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen,Dat Nguyen

Main category: cs.CV

TL;DR: 提出VirDA方法，通过在骨干网络前添加域特定的视觉重编程层进行视觉提示，实现高效的无监督域自适应，显著减少可训练参数数量并保持高性能。

Details

Motivation: 现有UDA方法对每个新源-目标域对微调整个骨干网络，导致参数和存储开销线性增长，且无法复用已训练好的骨干参数。受骨干网络存在纹理偏见的启发，希望利用域特定的纹理偏见进行更高效的域适应。 Method: 提出VirDA，在不微调骨干网络的前提下，在其前端添加域特定的视觉重编程层，生成调整输入图像风格的视觉提示；通过优化域内和域间分布差异的多目标函数来训练该层。 Result: 在Office-31上达到92.8%平均准确率，仅使用1.5M可训练参数；相比PDA提升1.6%准确率且仅用其46%参数；相比全微调方法CDTrans和FixBi分别提升0.2%和1.4%，但仅需1.7%和2.8%参数；相较最强方法PMTrans和TVT，仅用约1.7%参数，精度分别低2.2%和1.1%。 Conclusion: VirDA通过视觉重编程有效利用域特定纹理偏见，实现了高效、可复用的无监督域自适应，在大幅降低参数量的同时保持竞争力的性能。 Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA.Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its ``style'' to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[116] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Minh Tran,Maksim Siniukov,Zhangyu Jin,Mohammad Soleymani

Main category: cs.CV

TL;DR: 本文提出了一种名为离散面部编码（DFE）的无监督、数据驱动方法，用于从3D网格序列中学习紧凑且可解释的面部表情字典，通过残差向量量化变分自编码器（RVQ-VAE）实现，并在心理计算任务中优于FACS和其他现有方法。

Details

Motivation: 现有面部表情编码系统（如FACS）覆盖范围有限且依赖昂贵的手动标注，难以满足大规模应用需求。 Method: 首先使用3D可变形模型（3DMM）提取与身份无关的表情特征，分离头部姿态和面部几何等因素；然后利用RVQ-VAE对这些特征进行编码，生成共享码本中的离散标记序列，每个标记代表一种可复用的面部形变模式。 Result: 实验表明，DFE比FACS及其他面部编码方法能更精确地捕捉面部行为；在压力检测、人格预测和抑郁检测三个心理学任务中，基于DFE的Bag-of-Words模型性能优于FACS和Masked Autoencoders等强基线模型。 Conclusion: DFE提供了一种可扩展且高效的FACS替代方案，在心理和情感计算应用中具有广泛潜力。 Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

[117] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale

Yongbo Chen,Yanhao Zhang,Shaifali Parashar,Liang Zhao,Shoudong Huang

Main category: cs.CV

TL;DR: 提出了一种名为Con-NRSfM的新方法，用于处理共形变形下的非刚性结构-from-运动问题，通过图优化框架实现逐点重建，并解耦深度与共形尺度约束，提高了重建精度和鲁棒性。

Details

Motivation: 现有非刚性SLAM方法依赖强假设（如局部平面或线性变形），无法恢复共形尺度，限制了重建精度。 Method: 采用基于图的优化框架，利用2D图像形变进行逐点重建，解耦深度与共形尺度约束，并结合自监督学习网络生成稠密带纹理的3D点云。 Result: 在合成与真实数据上的实验表明，该方法在重建精度和鲁棒性方面优于现有方法。 Conclusion: Con-NRSfM有效解决了共形变形下的NRSfM问题，去除了传统假设限制，实现了更精确的深度和尺度估计。 Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.

[118] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao,Hongrui Wu,Ziyong Feng,Hujun Bao,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为UniVerse的统一框架，用于从不一致的多视角图像中进行鲁棒的3D场景重建。该方法通过视频扩散模型将不一致图像转换为一致图像，再进行3D重建，具有良好的泛化能力和性能。

Details

Motivation: 现有方法依赖密集观测且难以处理多种图像不一致性，因此需要一种更鲁棒、通用的重建方法。 Method: 将鲁棒重建解耦为修复与重建两个子任务，利用视频扩散模型从大规模数据中学习通用场景先验，先将不一致图像转为初始视频，再恢复为一致图像，最后进行3D重建。 Result: 在合成和真实数据集上实验表明，该方法在鲁棒重建方面表现出优越的性能和强泛化能力，并能控制重建3D场景的风格。 Conclusion: UniVerse通过引入基于扩散模型的统一框架，有效解决了多视角图像不一致下的3D重建难题，提升了模型在稀疏观测下的鲁棒性和适用性。 Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

[119] An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

Ke Jia,Ji Zhou,Hanxin Li,Zhigan Zhou,Haojie Chu,Xiaojie Li

Main category: cs.CV

TL;DR: 提出一种轻量级端到端模板匹配框架，将匹配任务重构为联合定位与几何回归，输出目标的中心坐标、旋转角度及独立的纵横向缩放，通过模板感知动态卷积和无几何标注训练策略实现高效精确的工业检测。

Details

Motivation: 传统方法在复合变换下因穷举角度和尺度导致效率低，而现有深度学习方法缺乏对几何姿态的显式建模，难以满足实际工业场景中对精度与效率的需求。 Method: 提出一个轻量级端到端网络，引入模板感知动态卷积模块（TDCM）注入模板特征以增强泛化能力；采用深度可分离卷积和像素混洗结构提升效率；设计基于旋转-剪切的数据增强策略生成结构感知伪标签，实现无需几何标注的训练；并加入轻量级细化模块进行局部优化以提升角度和尺度精度。 Result: 模型仅3.07M参数，在复合变换下达到14ms推理速度且保持高精度；在小模板和多目标场景中表现出强鲁棒性。 Conclusion: 该方法在效率、精度和实用性之间取得良好平衡，适用于实时工业应用，具备良好的部署潜力。 Abstract: In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

[120] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Xuchen Li,Xuzhao Li,Jiahui Gao,Renjie Pi,Shiyu Hu,Wentao Zhang

Main category: cs.CV

TL;DR: 提出一种自适应像素推理框架，通过操作感知的监督微调和 rollout 引导的强化学习，使视觉语言模型能根据查询难度动态决定是否使用像素级操作，在提升性能的同时显著减少不必要的计算。

Details

Motivation: 现有视觉语言模型在处理需要精细视觉理解的任务时表现不佳，主要由于图像编码过程中的信息丢失或对关键区域关注不足；同时，引入像素级信息常导致过度使用，引发效率低下和注意力分散。 Method: 首先采用操作感知的监督微调建立文本推理和视觉操作的基础能力，然后设计基于模型自身反馈的 rollout 引导强化学习框架，使其能根据输入查询的难度动态决定是否执行像素级操作。 Result: 在多个多模态推理基准上取得优异表现，HR-Bench 4K 上准确率达到 73.4%，工具使用率仅为 20.1%，相比先前方法在准确率提升的同时将工具使用减少了 66.5%。 Conclusion: 该自适应像素推理框架有效平衡了性能与效率，使视觉语言模型能够在需要时才调用高分辨率视觉细节，提升了细粒度视觉理解任务的表现。 Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model's own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.

[121] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu,Wei-Ning Chiu,Shun-Ting Chang,Meng-Ping Huang,Takeshi Tohyama,Ahram Han,Po-Chih Kuo

Main category: cs.CV

TL;DR: 提出了一种基于增强敏感性的风险评分框架（ASRS），通过测量临床合理的旋转扰动下的嵌入变化来识别易出错的胸片样本，从而提升医学AI的公平性和安全性。

Details

Motivation: 深度学习模型在胸片解读中表现良好，但在不同患者亚群中准确性不均，现有误差检测方法难以发现分布内的细微错误，缺乏对图像和表示一致性方法的有效探索。 Method: 提出ASRS框架，应用±15°/±30°的临床合理旋转，使用RAD-DINO编码器测量嵌入空间的变化，通过敏感性得分将样本分为稳定性四分位数，识别高风险样本。 Result: 高敏感性样本的召回率显著降低（-0.2至-0.3），尽管AUROC和置信度较高；ASRS能有效识别模型易错案例，且无需标签即可用于选择性预测和临床复核。 Conclusion: ASRS提供了一种无需标签的误差检测方法，可提升医学AI系统的公平性、可靠性和临床实用性。 Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

[122] FreeViS: Training-free Video Stylization with Inconsistent References

Jiacong Xu,Yiqun Mei,Ke Zhang,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频风格化框架FreeViS，通过融合多个风格参考图像到预训练的图像到视频模型中，实现了高保真和时间一致性的视频风格化。

Details

Motivation: 现有方法在逐帧进行图像风格化时存在时间不一致性且风格表现力不足，而专门训练视频风格化模型则需要成对的视频数据且计算成本高。因此，需要一种高效、无需训练且能保持高质量风格和时间连贯性的方法。 Method: FreeViS利用预训练的图像到视频生成模型，结合多幅风格参考图像，通过高频补偿约束内容结构与运动，并引入基于光流的运动线索来保留低显著性区域的风格纹理，从而实现无需训练的视频风格化。 Result: 实验表明，FreeViS在风格化保真度和时间一致性方面优于近期基线方法，获得了更高的用户偏好评分，且无需任何训练过程。 Conclusion: FreeViS提供了一种实用且经济的高质量视频风格化方案，能够在无需训练的前提下实现丰富的风格细节和强时间连贯性。 Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

[123] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu,Jinjie Wei,Wanying Qu,Chenglong Ma,Junzhi Ning,Yunheng Li,Ying Chen,Xinzhe Luo,Pengcheng Chen,Xin Gao,Ming Hu,Huihui Xu,Xin Wang,Shujian Gao,Dingkang Yang,Zhongying Deng,Jin Ye,Lihao Liu,Junjun He,Ningsheng Xu

Main category: cs.CV

TL;DR: 本文提出了MedQ-Bench，一个用于基于语言的医学图像质量评估的多模态大模型基准，引入感知与推理双任务范式，并通过多维度评判协议和临床医生对比验证，揭示当前MLLM在医学图像质量评估中能力尚不稳定的现状。

Details

Motivation: 现有医学图像质量评估方法多依赖于标量评分指标，无法反映专家评估中的人类推理过程，缺乏对模型感知与推理能力的系统性评测。 Method: 构建MedQ-Bench基准，包含MedQ-Perception（低层次感知）和MedQ-Reasoning（无参考与比较推理）两个任务，覆盖五种影像模态和四十多种质量属性；提出四维评判协议，并进行人类与AI判断的一致性验证。 Result: 评估了14种最先进的多模态大模型，发现其在感知和推理任务上表现初步但不稳定，准确率不足以支持可靠临床应用；并通过放射科医生验证了评判协议的有效性。 Conclusion: 当前多模态大模型在医学图像质量评估方面仍有显著优化空间，MedQ-Bench为推动该领域发展提供了重要工具和方向。 Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

[124] Holistic Order Prediction in Natural Scenes

Pierre Musacchio,Hyunmin Lee,Jaesik Park

Main category: cs.CV

TL;DR: 提出InstaFormer，一种仅通过单次前向传播即可从RGB图像预测场景中所有实例的遮挡和深度顺序的网络。

Details

Motivation: 现有方法依赖昂贵的输入格式（如类别标签、二值分割掩码）和高推理成本（二次方数量的前向传播），难以实现实例级几何理解。 Method: 设计InstaFormer，通过对象查询与潜在掩码描述符之间的交互，实现对实例遮挡和深度顺序的联合预测，仅需单次前向传播。 Result: 在多个基准上进行了全面评估和消融实验，验证了该方法在效率和准确性上的有效性。 Conclusion: InstaFormer能够高效、准确地预测实例间的几何关系，显著降低了输入和计算成本，推动了视觉模型对实例级几何结构的理解。 Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.

[125] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

Raahul Krishna Durairaju,K. Saruladha

Main category: cs.CV

TL;DR: 本文提出了一种基于金字塔位置编码（PPE）的Transformer框架PyramidStyler，用于高效、高质量的神经风格迁移，在内容和风格损失上显著降低，并实现快速推理。

Details

Motivation: 现有的CNN和Transformer模型在处理复杂风格和高分辨率图像时扩展性差，计算开销大，难以实现实时高质量风格迁移。 Method: 提出PyramidStyler，采用金字塔位置编码（PPE）实现多尺度特征建模，并结合强化学习动态优化风格化过程，提升收敛速度和生成质量。 Result: 在COCO和WikiArt数据集上训练后，4000轮迭代下内容损失降至2.07，风格损失降至0.86，推理时间为1.39秒；引入强化学习后进一步优化至内容2.03、风格0.75，推理时间仅1.40秒。 Conclusion: PyramidStyler实现了高效、可扩展的实时艺术风格迁移，在高分辨率和复杂风格下表现优异，具有在媒体与设计领域广泛应用的潜力。 Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

[126] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction

Sheng-Hsiang Hung,Ting-Yu Yen,Wei-Fang Sun,Simon See,Shih-Hsuan Hung,Hung-Kuo Chu

Main category: cs.CV

TL;DR: 本文提出了LoBE-GS，一种面向大规模场景的负载均衡且高效的3D高斯点阵化框架，通过深度感知分割、优化分配和轻量级训练技术，显著提升了训练效率与可扩展性。

Details

Motivation: 现有的3D高斯点阵化方法在处理大范围、开放场景时存在内存压力大、分区负载不均和粗到精流程效率低的问题，难以实现高效重建。 Method: 提出LoBE-GS框架，包括深度感知的场景分割方法、基于优化的可见高斯分布均衡策略，以及可见性裁剪和选择性稠密化两种轻量技术，以提升训练效率和负载均衡。 Result: 在大规模城市和户外数据集上的实验表明，LoBE-GS相比现有最先进方法可实现最高2倍的端到端训练加速，同时保持重建质量，并支持传统3DGS无法处理的大规模场景。 Conclusion: LoBE-GS有效解决了大规模3DGS中的负载不均与效率瓶颈，为城市级场景的实时高保真重建提供了可行方案。 Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians -- a strong proxy for computational load -- across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.

[127] Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu,Guozhen Zhang,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Xuming He

Main category: cs.CV

TL;DR: 本文提出了MemoryPack和Direct Forcing两种方法，以解决长视频生成中的长程依赖建模和自回归解码中的误差累积问题，显著提升了生成结果的上下文一致性和可靠性。

Details

Motivation: 长视频生成面临两个主要挑战：一是捕捉长程依赖关系，二是防止自回归解码过程中误差的累积。现有方法在这两方面表现不足，因此需要更有效的建模机制和训练-推理对齐策略。 Method: 提出MemoryPack，一种可学习的上下文检索机制，结合文本和图像信息进行全局引导，联合建模短期和长期依赖；同时引入Direct Forcing，一种高效的单步近似策略，改善训练与推理的一致性，减少推理过程中的误差传播。 Result: MemoryPack能够实现分钟级的时间一致性，且计算复杂度线性增长，具有良好的可扩展性；Direct Forcing有效抑制了误差累积。二者结合显著提升了长视频生成的质量和稳定性。 Conclusion: MemoryPack与Direct Forcing共同增强了长视频生成模型的上下文一致性和可靠性，推动了自回归视频模型在实际应用中的可行性。 Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

[128] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving

Cornelius Schröder,Marius-Raphael Schlüter,Markus Lienkamp

Main category: cs.CV

TL;DR: 本文研究了3D目标检测器分类任务中的置信度校准问题，提出两种辅助正则化损失项以改善训练过程中的预测校准，并结合后处理方法评估多种模型的校准效果。

Details

Motivation: 精确的对象检测和不确定性估计对自动驾驶系统至关重要，现有3D目标检测器在分类置信度校准方面存在不足，尤其是对主导类和次级类预测的整体校准缺乏有效度量和方法。 Method: 提出了两个辅助的正则化损失项：一个用于校准主导预测，另一个用于校准完整的预测向量；并在CenterPoint、PillarNet和DSVT-Pillar上评估了多种训练时和后处理校准方法的组合效果。 Result: 实验表明，将全类别预测校准损失与等渗回归结合，能显著提升CenterPoint和PillarNet在主导和次级类别上的校准性能；但该组合无法同时校准DSVT-Pillar的主导和次级预测。 Conclusion: 全预测分布的校准对于3D目标检测器是必要且可行的，所提方法有效提升了主流模型的置信度可靠性，但不同架构对联合校准的适应性存在差异。 Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.

[129] Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim,Sooyoung Yang,Jihyong Oh,Myungjoo Kang,Chanho Eom

Main category: cs.CV

TL;DR: 本文提出了一种名为DiffPS的新框架，利用预训练扩散模型来解决行人搜索中检测与重识别任务之间的优化冲突，并通过三个专用模块提升了行人定位、减少形状偏差和语义自适应特征聚合，在CUHK-SYSU和PRW数据集上达到了最先进的性能。

Details

Motivation: 现有方法多使用ImageNet预训练主干网络，且检测与重识别共享特征，导致难以捕捉复杂空间上下文和细粒度身份线索，并因优化目标冲突而性能受限。 Method: 提出DiffPS框架，利用预训练扩散模型的先验知识，设计了三个模块：扩散引导区域建议网络（DGRPN）、多尺度频率 refinement 网络（MSFRN）和语义自适应特征聚合网络（SFAN），分别提升定位精度、缓解形状偏差并融合文本对齐的扩散特征。 Result: DiffPS在CUHK-SYSU和PRW两个主流行人搜索数据集上均取得了新的最先进性能。 Conclusion: DiffPS有效利用扩散模型先验知识，解耦检测与重识别的特征学习，通过三个专用模块显著提升了行人搜索的整体性能。 Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

[130] Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction

Yi Ai,Yuanhao Cai,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 提出了一种名为Flow-Matching-guided Unfolding network (FMU)的新方法，首次将流匹配引入高光谱成像重建，结合生成先验与深度展开框架，在模拟和真实数据上均显著优于现有方法。

Details

Motivation: 高光谱成像因硬件限制和从压缩测量中重建三维数据的困难而成本高昂，现有压缩感知系统（如CASSI）在重建时仍面临严重退化和光谱细节丢失问题。 Method: 将流匹配的生成先验嵌入深度展开网络框架，并引入均值速度损失以增强流的全局一致性，从而提升重建的鲁棒性和准确性。 Result: 在多个模拟和真实数据集上的实验表明，FMU在重建质量上显著优于现有方法。 Conclusion: FMU通过融合优化方法的可解释性与流匹配的生成能力，为高光谱图像重建提供了高效且准确的新方案。 Abstract: Hyperspectral imaging (HSI) provides rich spatial-spectral information but remains costly to acquire due to hardware limitations and the difficulty of reconstructing three-dimensional data from compressed measurements. Although compressive sensing systems such as CASSI improve efficiency, accurate reconstruction is still challenged by severe degradation and loss of fine spectral details. We propose the Flow-Matching-guided Unfolding network (FMU), which, to our knowledge, is the first to integrate flow matching into HSI reconstruction by embedding its generative prior within a deep unfolding framework. To further strengthen the learned dynamics, we introduce a mean velocity loss that enforces global consistency of the flow, leading to a more robust and accurate reconstruction. This hybrid design leverages the interpretability of optimization-based methods and the generative capacity of flow matching. Extensive experiments on both simulated and real datasets show that FMU significantly outperforms existing approaches in reconstruction quality. Code and models will be available at https://github.com/YiAi03/FMU.

[131] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

Wei-Lung Mao,Chun-Chi Wang,Po-Heng Chou,Yen-Ting Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习和ConSinGAN生成数据的DIP封装自动缺陷检测系统，结合YOLO模型与SCADA架构，在表面和引脚缺陷检测中实现了高准确率（95.50%）和快速检测（285ms）。

Details

Motivation: 传统工业元件缺陷检测耗时且依赖人力，导致质检负担重、质量控制困难。因此需要一种自动化、高效的缺陷检测方法以提升检测精度与效率。 Method: 采用数字相机光学系统采集图像，使用ConSinGAN生成缺陷数据以解决样本不足问题，并对比YOLOv3、v4、v7、v9四种模型在有无数据增强下的表现，最终构建集成SCADA系统的自动化检测框架。 Result: YOLOv7结合ConSinGAN在准确率达到95.50%，检测时间为285ms，显著优于其他YOLO版本和基于阈值的方法；该系统能有效应对多种缺陷类型和数据不足场景。 Conclusion: 所提出的基于YOLOv7与ConSinGAN的自动缺陷检测系统在工业DIP元件检测中表现出高性能，具备良好的实用性与扩展性，可广泛应用于缺乏缺陷样本的工业场景。 Abstract: Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50\%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.

[132] Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Guangyao Zhai,Yue Zhou,Xinyan Deng,Lars Heckler,Nassir Navab,Benjamin Busam

Main category: cs.CV

TL;DR: 本文提出了一种名为FoundAD的少样本异常检测方法，利用大规模预训练视觉编码器学习到的正常图像分布特性，通过非线性投影算子将特征映射到自然图像流形上，从而有效识别图像中的异常区域。

Details

Motivation: 少样本条件下准确区分正常与异常特征具有挑战性，尤其是在无类别先验的情况下。现有方法参数量大且难以泛化。 Method: 基于预训练的视觉编码器，设计了一个非线性投影算子，用于刻画图像嵌入差异，并据此检测偏离正常分布的异常区域。 Result: 在多类异常检测任务中表现出竞争性性能，且模型参数量显著少于先前方法，兼容多种基础编码器（如DINOv3）。 Conclusion: 该方法为利用基础模型特征进行少样本异常检测提供了新视角，提升了检测效率与通用性。 Abstract: Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

[133] ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello,Ronja Güldenring,Lazaros Nalpantidis

Main category: cs.CV

TL;DR: 本文提出了ClustViT，一种基于Vision Transformer的语义分割方法，通过可学习的聚类模块合并相似token，并用重建模块恢复细节，在降低计算量的同时保持了分割精度。

Details

Motivation: Vision Transformers虽然性能优越，但其二次注意力复杂度限制了在真实机器人系统中的应用，尤其是在密集预测任务如语义分割中，现有token合并方法不适用。 Method: 提出ClustViT，引入可训练的Cluster模块，根据分割掩码生成的伪聚类信息在网络中动态合并相似token，并设计Regenerator模块恢复细粒度细节以适应下游密集预测任务。 Result: 在三个数据集上实现了最多2.18倍的GFLOPs减少和1.64倍的推理加速，同时保持了相当的分割准确率。 Conclusion: ClustViT有效平衡了计算效率与语义分割性能，提升了ViT在资源受限的实际机器人系统中的适用性。 Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Yongyi Su,Haojie Zhang,Shijie Li,Nanqing Liu,Jingyi Liao,Junyi Pan,Yuan Liu,Xiaofen Xing,Chong Sun,Chen Li,Nancy F. Chen,Shuicheng Yan,Xulei Yang,Xun Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Patch-as-Decodable Token (PaDT)的统一范式，使多模态大语言模型（MLLM）能够直接生成文本和多种视觉输出，通过引入视觉参考令牌（VRTs）和轻量级解码器，在检测、分割和定位任务中实现了最先进的性能。

Details

Motivation: 现有MLLM在视觉任务中依赖间接表示（如用文本生成坐标），限制了性能，难以支持密集预测任务（如分割）。因此需要一种能直接生成多样化视觉输出的新方法。 Method: 提出PaDT框架，利用从图像块嵌入生成的视觉参考令牌（VRTs），将其与LLM的文本输出令牌无缝交织，并通过轻量级解码器将LLM输出转换为检测、分割和指代解析结果；VRT在每次前向传播中独立处理，并动态扩展嵌入表以提升定位和区分能力；训练时采用随机选择VRT进行监督微调及鲁棒的逐令牌交叉熵损失。 Result: 在四个视觉感知与理解任务上的实验表明，PaDT持续达到最先进性能，甚至优于规模更大的MLLM模型。 Conclusion: PaDT为MLLM实现统一的文本与视觉生成提供了一个高效且可扩展的框架，在多种视觉密集预测任务中表现出色，具有广泛的应用潜力。 Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

[135] TriAlignXA: An Explainable Trilemma Alignment Framework for Trustworthy Agri-product Grading

Jianfei Xie,Ziyang Li

Main category: cs.CV

TL;DR: 本文提出了一种可解释AI框架TriAlignXA，通过三引擎优化和预映射机制，在农产品质量、时效性和经济性构成的“不可能三角”中实现平衡，提升在线果蔬电商中的消费者信任。

Details

Motivation: 解决在线生鲜电商中存在的信任赤字问题，因数字交易无法提供对产品质量的直接感官体验，导致消费者信任不足。 Method: 构建“信任金字塔”模型，提出“三角信任指数”（TTI），设计TriAlignXA可解释AI框架，包含生物自适应、时效优化和经济优化三大引擎，并引入预映射机制将过程数据编码为QR码以增强透明度。 Result: 实验表明该框架在分级任务中显著优于基线模型，验证了其在‘不可能三角’中实现权衡的能力，并通过实证与理论分析确认其对信任构建的有效性。 Conclusion: TriAlignXA框架为建立可信的在线农产品交易生态系统提供了从理论到实践的完整支持，实现了从算法决策到消费者信任的关键转化路径。 Abstract: The 'trust deficit' in online fruit and vegetable e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality. This paper constructs a 'Trust Pyramid' model through 'dual-source verification' of consumer trust. Experiments confirm that quality is the cornerstone of trust. The study reveals an 'impossible triangle' in agricultural product grading, comprising biological characteristics, timeliness, and economic viability, highlighting the limitations of traditional absolute grading standards. To quantitatively assess this trade-off, we propose the 'Triangular Trust Index' (TTI). We redefine the role of algorithms from 'decision-makers' to 'providers of transparent decision-making bases', designing the explainable AI framework--TriAlignXA. This framework supports trustworthy online transactions within agricultural constraints through multi-objective optimization. Its core relies on three engines: the Bio-Adaptive Engine for granular quality description; the Timeliness Optimization Engine for processing efficiency; and the Economic Optimization Engine for cost control. Additionally, the "Pre-Mapping Mechanism" encodes process data into QR codes, transparently conveying quality information. Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. Empirical evidence and theoretical analysis verify the framework's balancing capability in addressing the "impossible triangle". This research provides comprehensive support--from theory to practice--for building a trustworthy online produce ecosystem, establishing a critical pathway from algorithmic decision-making to consumer trust.

[136] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

Lei Liu,Can Wang,Zhenghao Chen,Dong Xu

Main category: cs.CV

TL;DR: 提出4DGS-Craft，一种具有一致性和交互性的4D高斯泼溅编辑框架，通过4D感知的InstructPix2Pix模型、多视角网格模块和高斯选择机制实现视图、时序和非编辑区域的一致性，并利用LLM理解并分解复杂用户指令。

Details

Motivation: 现有4D高斯泼溅编辑方法在视图、时间以及非编辑区域一致性方面存在不足，且难以处理复杂文本指令，因此需要一个更一致且可交互的编辑框架。 Method: 引入4D感知的InstructPix2Pix模型，结合4D VGGT几何特征和多视角网格模块以增强视图与时间一致性；设计基于高斯选择的机制保护非编辑区域；采用基于大语言模型（LLM）的模块解析用户指令，将其分解为原子操作序列。 Result: 实现了在视图、时间和非编辑区域上更一致的4D场景编辑，能够准确解析和执行复杂文本指令，显著提升编辑的可控性与交互性。 Conclusion: 4DGS-Craft有效解决了4D高斯泼溅编辑中的一致性与交互性难题，支持复杂指令下的高质量场景编辑，推动了4D内容创作的发展。 Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.

[137] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution

Junyu Wu,Jie Tang,Jie Liu,Gangshan Wu

Main category: cs.CV

TL;DR: 提出了一种名为Pure-Pass（PP）的像素级掩码机制，用于图像超分辨率任务，通过识别“纯像素”并免除其复杂计算，在保持高性能的同时显著降低计算开销。

Details

Motivation: 现有轻量级图像超分辨率方法在适应性、掩码粒度和空间灵活性方面存在不足，尤其是CAMixer等方法难以精细控制计算分配。 Method: 设计了基于固定颜色中心点的像素分类策略，实现细粒度、空间灵活的像素级掩码机制，并集成到ATD-light模型中形成PP-ATD-light。 Result: PP-ATD-light在重建质量和参数效率上优于CAMixer-ATD-light，同时节省相似的计算量。 Conclusion: Pure-Pass机制有效提升了超分辨率模型的计算效率与适应性，为轻量级SR网络提供了更优的动态计算路由方案。 Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.

[138] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa,Ryo Takahashi,Tomoya Kitano,Yukihiro Iida,Chisako Muramatsu,Tatsuro Hayashi,Yuta Seino,Xiangrong Zhou,Takeshi Hara,Akitoshi Katsumata,Hiroshi Fujita

Main category: cs.CV

TL;DR: 本研究提出了一种基于自校正循环与结构化输出（SLSO）框架的方法，利用GPT-4o的多模态能力自动生成牙科全景片中颌骨囊肿的影像所见，相较于传统的思维链（CoT）方法在多个评估项目上提升了准确性，尤其在牙齿编号、牙齿移位和牙根吸收方面表现更优。

Details

Motivation: 提高AI在医学影像报告生成中的准确性和可靠性，减少幻觉现象并增强结构化输出的一致性。 Method: 构建了一个包含10个步骤的SLSO框架，包括图像输入分析、结构化数据生成、牙齿编号提取与一致性检查、不一致时迭代重生成、最终结果生成与验证，并与传统CoT方法进行比较。 Result: SLSO框架在牙齿编号、牙齿移位和牙根吸收上的准确率分别提高了66.9%、33.3%和28.6%，最多经过五次重生成即可获得一致的结构化输出，有效抑制了幻觉并提升了负性发现描述的能力，但在跨多颗牙齿的大范围病变识别上仍有局限。 Conclusion: SLSO框架能有效提升GPT-4o在颌骨囊肿影像报告生成中的准确性与一致性，具有临床应用潜力，但仍需进一步优化以应对复杂病变并提升整体性能。 Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.

[139] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction

Mario Resino,Borja Pérez,Jaime Godoy,Abdulla Al-Kaff,Fernando García

Main category: cs.CV

TL;DR: 提出了一种名为LiLa-Net的3D自编码器架构，仅使用LiDAR点云数据高效编码真实交通环境特征。

Details

Motivation: 设计一种资源消耗低但性能高的3D点云自编码器，适用于真实交通环境下的感知系统。 Method: 采用简化编码器层和优化跳跃连接的自编码器结构，基于LiDAR点云进行端到端学习。 Result: 实现了高质量的点云重建，在保持高效性的同时展现出良好的泛化能力，能重建非交通环境物体。 Conclusion: LiLa-Net在减少资源消耗的前提下，通过平衡跳跃连接与潜在编码，实现了高效且具泛化性的点云重建。 Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR's point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.

[140] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring

Jenna Kline,Maksim Kholiavchenko,Samuel Stevens,Nina van Tiel,Alison Zhong,Namrata Banerji,Alec Sheets,Sowbaranika Balasubramaniam,Isla Duporge,Matthew Thompson,Elizabeth Campolongo,Jackson Miliko,Neil Rosser,Tanya Berger-Wolf,Charles V. Stewart,Daniel I. Rubenstein

Main category: cs.CV

TL;DR: 本文提出了一种名为kabr-tools的开源工具包，结合无人机视频与机器学习技术，实现对野生动物多物种行为的自动化监测，显著提升了行为观测的精度与规模。

Details

Motivation: 传统野外观察方法受限于范围、耗时且劳动密集，难以全面捕捉复杂的行为模式，因此需要一种可扩展的技术手段来提升行为生态学研究的能力。 Method: 开发了一个集成无人机视频与机器学习系统的分析框架（kabr-tools），包含目标检测、追踪和行为分类模块，用于提取时间分配、行为转换、社会互动、栖息地关联和群体动态等行为指标。 Result: 相比地面观测，无人机方法减少15%的视野丢失，捕获更多且更准确连续的行为转换；通过三个案例研究分析了969个行为序列，发现Grevy斑马和普通斑马的警觉性随群体大小增加而降低，但栖息地对其影响不同，且二者表现出强烈的行为惯性及空间分隔。 Conclusion: kabr-tools实现了大规模自动化行为监测，为生态系统范围的研究、生物多样性保护和生态监测提供了强有力的工具。 Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy's zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy's zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy's zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.

[141] GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing

Mengtian Li,Yunshu Bai,Yimin Chu,Yijun Shen,Zhongmei Li,Weifeng Ge,Zhifeng Xie,Chaofeng Chen

Main category: cs.CV

TL;DR: 提出GaussianMorphing，一种基于多视图图像的语义感知3D形状与纹理变形新框架，利用网格引导的3D高斯点阵实现高保真几何与外观建模，无需标注数据即可保持局部细节和全局语义一致性。

Details

Motivation: 现有方法依赖点云或需预定义同胚映射，难以处理无纹理数据且缺乏语义一致性，因此需要一种能同时保证几何一致性和纹理保真的新型形变框架。 Method: 采用网格引导的3D高斯点阵（3DGS）进行几何与外观建模，将3D高斯锚定到网格片上，通过拓扑感知约束实现统一变形策略，并利用网格拓扑作为几何先验建立无监督语义对应关系，结合物理合理的点轨迹保持结构完整性。 Result: 在提出的TexMorph基准上显著优于先前2D/3D方法，颜色一致性误差（ΔE）降低22.2%，EI降低26.2%。 Conclusion: GaussianMorphing通过融合网格引导的3D高斯表示与无监督语义对应，实现了高质量、语义一致的3D形态过渡，无需标签数据，兼具细节保持与结构完整性。 Abstract: We introduce GaussianMorphing, a novel framework for semantic-aware 3D shape and texture morphing from multi-view images. Previous approaches usually rely on point clouds or require pre-defined homeomorphic mappings for untextured data. Our method overcomes these limitations by leveraging mesh-guided 3D Gaussian Splatting (3DGS) for high-fidelity geometry and appearance modeling. The core of our framework is a unified deformation strategy that anchors 3DGaussians to reconstructed mesh patches, ensuring geometrically consistent transformations while preserving texture fidelity through topology-aware constraints. In parallel, our framework establishes unsupervised semantic correspondence by using the mesh topology as a geometric prior and maintains structural integrity via physically plausible point trajectories. This integrated approach preserves both local detail and global semantic coherence throughout the morphing process with out requiring labeled data. On our proposed TexMorph benchmark, GaussianMorphing substantially outperforms prior 2D/3D methods, reducing color consistency error ($\Delta E$) by 22.2% and EI by 26.2%. Project page: https://baiyunshu.github.io/GAUSSIANMORPHING.github.io/

[142] Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor,Romit Roy Choudhury

Main category: cs.CV

TL;DR: 本文提出了一种名为InPose的新方法，通过仅使用旋转测量并结合预训练扩散模型与位置测量的似然项，实现无需用户特定训练的零样本姿态估计。

Details

Motivation: 由于用户体型差异导致的位置测量变化使得现有基于位置和旋转条件扩散模型的姿态估计算法跨用户泛化能力差。 Method: 将姿态估计建模为逆问题，利用预训练扩散模型仅以旋转数据为条件，并通过来自位置测量的似然项引导模型先验。 Result: 该方法在仅有少量体上传感器的情况下，实现了对任意用户的零样本泛化姿态估计，表现出良好的跨用户适应性和准确性。 Conclusion: InPose能够在不依赖用户特定数据训练的情况下，准确估计人体全身姿态，显著提升了在实际场景中的适用性和鲁棒性。 Abstract: Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

[143] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation

Arman Behnam

Main category: cs.CV

TL;DR: 提出VGDM：一种基于视觉Transformer的扩散模型，用于脑肿瘤检测与分割，结合全局上下文推理和迭代去噪，提升分割精度和鲁棒性。

Details

Motivation: 传统U-Net在捕捉长距离依赖方面能力有限，难以准确分割复杂肿瘤结构，亟需更强大的模型提升医学图像分割性能。 Method: 将视觉Transformer嵌入扩散模型核心，构建名为VGDM的Transformer驱动的扩散框架，利用Transformer建模全脑MRI体积的空间关系，并通过扩散过程逐步优化分割结果。 Result: 在脑肿瘤MRI数据集上实验表明，VGDM在Dice相似系数和Hausdorff距离上均优于传统方法，显著提升分割准确性与边界精细度。 Conclusion: VGDM通过融合Transformer的全局建模能力和扩散模型的精细化修复优势，为脑肿瘤分割提供了更鲁棒、可扩展的解决方案，推动了该领域技术的发展。 Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.

[144] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

Walid Rabehi,Marion Le Texier,Rémi Lemoy

Main category: cs.CV

TL;DR: 本研究提出了一种可扩展的深度学习流程，利用双通道U-Net模型从1925-1950年的Scan Histo历史地图中提取法国全国范围的城市用地数据，生成了该关键时期首个开放获取的国家级城市足迹数据集。

Details

Motivation: 在1970年代之前，由于缺乏全国性的数字化城市足迹数据，对法国城市扩张的定量分析受到限制。 Method: 采用双通道U-Net方法，第一阶段生成初步结果并识别混淆区域（如文本和道路），用于指导数据增强；第二阶段利用优化后的数据集和第一阶段的二值化输出减少辐射噪声，降低误检率，并在高性能计算集群上处理941个高分辨率图块。 Result: 成功构建覆盖整个法国本土的城市足迹镶嵌图，总体精度达73%，有效捕捉多种城市模式，抑制了标签和等高线等常见干扰因素。 Conclusion: 该研究填补了早期城市化数据空白，公开发布的代码、训练数据和全国城市栅格数据集有助于推动长期城市化动态的研究。 Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.

[145] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

Woowon Jang,Jiwon Im,Juseung Choi,Niki Rashidian,Wesley De Neve,Utku Ozbulak

Main category: cs.CV

TL;DR: 本研究系统分析了在腹腔镜胆囊切除术视频中基于点的跟踪在手术环境中的失效模式，发现其在手术工具上表现良好，但在解剖结构上因组织相似性和边界模糊而表现较差，并提出了改进跟踪点选择和放置的实用建议。

Details

Motivation: 理解基于点的跟踪在复杂手术环境中的可靠性及其失效情况，以提升其在手术视频分析中的应用效果。 Method: 通过比较基于点的跟踪与分割掩码初始化在胆囊、抓钳和L钩电刀三个手术目标上的性能，系统分析其失败模式。 Result: 基于点的跟踪在手术工具上具有竞争力，但在解剖目标上表现不佳，主要由于组织相似性和边界模糊导致跟踪失败。 Conclusion: 基于点的跟踪适用于手术工具，但对解剖结构需谨慎使用；研究提供了优化跟踪点选择和放置的实用建议。 Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

[146] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

Ding-Ruei Shen

Main category: cs.CV

TL;DR: 本文提出了一个名为FFREEDG的新任务，即在无需访问源域数据且客户端仅有无标签数据的情况下，利用视觉-语言模型进行联邦语义分割；为此设计了FRIEREN框架，结合CLIP文本嵌入引导的视觉-语言解码器和弱到强一致性学习策略，在跨域场景中表现出色。

Details

Motivation: 现有联邦学习方法通常假设客户端有标注数据或未能充分利用视觉基础模型，难以应对实际中普遍存在的域偏移和无标签数据挑战，因此需要一种更贴近现实、能有效利用无标签数据并发挥VFMs优势的新方法。 Method: 提出FRIEREN框架：在服务器端预训练后，利用视觉-语言解码器（由CLIP文本嵌入引导）提升语义区分能力，并采用弱增强与强增强之间的一致性学习策略，基于伪标签在客户端进行鲁棒本地训练。 Result: 在合成到真实场景以及晴朗到恶劣天气的跨域基准上，FRIEREN显著优于现有的领域自适应与泛化方法，验证了其在无标签联邦语义分割任务中的有效性。 Conclusion: FRIEREN为无标签数据下的联邦语义分割提供了有效解决方案，展示了视觉-语言模型在该领域的潜力，为未来研究建立了强有力的基线。 Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

[147] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Shu Zou,Xinyu Tian,Lukas Wesemann,Fabian Waschkowski,Zhaoyuan Yang,Jing Zhang

Main category: cs.CV

TL;DR: 提出ASK-Hint，一种基于动作中心知识的结构化提示框架，用于提升冻结视觉语言模型在视频异常检测中的性能。

Details

Motivation: 现有提示方法过于抽象，忽视了定义复杂异常的细粒度人-物交互或动作语义。 Method: 将提示组织为语义连贯的类别，并设计细粒度引导问题，使模型预测与判别性视觉线索对齐。 Result: 在UCF-Crime和XD-Violence上显著提升AUC，优于微调和无训练方法，具备良好跨数据集和模型泛化能力。 Conclusion: 提示粒度至关重要，ASK-Hint是一种新型无训练、可泛化且可解释的视频异常检测方案。 Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

[148] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou,Xu Zhang,Yi Bin,Jian Liu,Bo Peng,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 提出GeoPurify方法，通过几何先验净化2D VLM生成的3D点特征，在极低数据量下实现高效语义分割。

Details

Motivation: 现有2D到3D特征迁移方法存在噪声预测与高成本训练的权衡问题，主要因分割匹配范式难以融合2D语义与3D结构。 Method: 设计学生亲和网络，利用自监督3D教师模型提取的几何先验来净化2D VLM生成的3D特征，并在推理时引入几何引导池化模块以增强结构一致性。 Result: 在主流3D基准上达到或超越现有最好性能，仅使用约1.5%的训练数据。 Conclusion: GeoPurify有效缓解了2D-to-3D迁移中的性能与效率矛盾，实现了高数据效率的3D语义分割。 Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).

[149] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications

Emmanuel Nsengiyumvaa,Leonard Niyitegekaa,Eric Umuhoza

Main category: cs.CV

TL;DR: 提出一种基于耳部静脉模式的非侵入式猪只生物识别方法，使用智能手机拍摄耳部图像，通过计算机视觉和机器学习（SVM）实现98.12%的识别准确率，平均处理时间8.3秒，适用于小规模农场的低成本、无应激个体识别。

Details

Motivation: 传统猪只识别方法（如耳标和芯片）存在不可靠、成本高、不适用于杂交品种和小规模农户的问题，亟需一种低成本、可靠且非侵入式的识别方案。 Method: 采集20头杂交猪的800张耳部图像，利用智能手机和背光照明；构建多阶段计算机视觉流程增强静脉可见性，提取结构与空间特征，并采用机器学习模型（包括SVM）进行分类。 Result: 支持向量机（SVM）在跨品种猪群中达到98.12%的识别精度，平均处理时间为8.3秒，验证了系统在实时农场应用中的可行性。 Conclusion: 耳部静脉生物识别是一种可行且高效的猪只个体识别方法，可替代易损的物理标识，为资源有限的农业社区提供低成本、无应激的数字化养殖解决方案，推动精准农业的普及。 Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.

[150] MMDEW: Multipurpose Multiclass Density Estimation in the Wild

Villanelle O'Reilly,Jonathan Cox,Georgios Leontidis,Marc Hanheide,Petra Bosilj,James Brown

Main category: cs.CV

TL;DR: 提出了一种基于Twins金字塔视觉Transformer的多类别密度图估计方法，通过引入类别聚焦模块抑制类别间干扰，在VisDrone和iSAID数据集上显著优于现有方法，并展示了在生物多样性监测中的应用潜力。

Details

Motivation: 传统检测方法在密集遮挡场景中难以准确计数，需借助密度图估计实现更鲁棒的多类别计数。 Method: 采用Twins金字塔视觉Transformer作为骨干网络，结合多尺度解码的专用多类别计数头，并设计基于分割的类别聚焦模块以抑制训练时的类别间交叉干扰。 Result: 在VisDrone和iSAID基准上实现了比先前方法更低的MAE（减少33%、43%和64%），且优于YOLOv11，验证了其在密集场景下的优势；并在生物多样性数据集中展现跨领域适用性。 Conclusion: 所提方法在多类别密集计数任务中表现优越，具备向生态监测等新领域扩展的能力，有助于推动可扩展的生态学研究与保护实践。 Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

[151] TempoControl: Temporal Attention Guidance for Text-to-Video Models

Shira Schiber,Ofir Lindenbaum,Idan Schwartz

Main category: cs.CV

TL;DR: 本文提出了一种名为TempoControl的方法，可在无需重新训练或额外监督的情况下，实现文本到视频生成过程中视觉概念的时序对齐控制。

Details

Motivation: 现有的生成视频模型缺乏对视觉元素出现时间的精细控制，用户无法指定特定内容在视频中的具体出现时机。 Method: 利用文本到视频扩散模型中的交叉注意力图，通过相关性、能量和熵三个原则优化注意力机制，以控制视觉概念的时间分布。 Result: 实现了对视频中单个或多个对象、动作甚至音频对齐生成的精确时序控制，同时保持了高质量和多样性。 Conclusion: TempoControl为生成视频提供了灵活且精确的时间控制能力，扩展了现有生成模型在复杂应用场景下的实用性。 Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.

[152] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Sicheng Feng,Kaiwen Tuo,Song Wang,Lingdong Kong,Jianke Zhu,Huan Wang

Main category: cs.CV

TL;DR: 本文提出RewardMap，一种多阶段强化学习框架，用于提升多模态大语言模型在细粒度视觉推理任务中的表现，特别是在交通地图等复杂场景下的空间推理能力。

Details

Motivation: 现有MLLM在细粒度视觉推理（如交通图的空间理解）上表现不佳，且标准强化学习因奖励稀疏和优化不稳定而难以有效训练。 Method: 构建了带有密集奖励信号的扩展数据集ReasonMap-Plus，并提出RewardMap框架，包含难度感知奖励设计和多阶段强化学习策略，从感知逐步过渡到复杂推理。 Result: 在ReasonMap和ReasonMap-Plus上验证了各组件的有效性，组合使用效果最佳；在6个基准上平均提升3.47%，显著增强模型的视觉理解与推理能力。 Conclusion: RewardMap通过密集奖励和渐进式训练策略，有效提升了MLLM在细粒度和空间视觉推理任务中的性能，具有广泛的应用潜力。 Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

[153] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Zihan Zhou,Shilin Lu,Shuli Leng,Shaocong Zhang,Zhuming Lian,Xinlei Yu,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: 本文提出了DragFlow，首个有效利用FLUX强大生成先验进行基于拖拽的图像编辑的框架，通过引入区域化编辑范式、预训练适配器和多模态大模型，显著提升了编辑效果。

Details

Motivation: 早期基于UNet的扩散模型在拖拽编辑中存在目标区域失真的问题，主要由于其潜在空间先验不足；而新兴的DiT架构虽具备更强的生成先验，但尚未被有效应用于拖拽编辑任务。 Method: 提出区域化编辑范式，使用仿射变换提供更一致的特征监督；结合预训练的IP-Adapter增强主体一致性，并通过梯度掩码保持背景保真度；引入多模态大语言模型解决任务歧义。 Result: 在DragBench-DR和新构建的ReD Bench上实验表明，DragFlow显著优于点基和区域基线方法，成为当前最先进的拖拽编辑方法。 Conclusion: DragFlow成功利用了DiT架构的强大先验，通过区域化监督和多模块协同设计，解决了传统拖拽编辑中的失真与一致性问题，推动了该领域的进展。 Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

[154] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Guangyu Sun,Archit Singhal,Burak Uzkent,Mubarak Shah,Chen Chen,Garin Kessler

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频理解方法F2C，通过选择时间连贯的关键片段（key clips）而非孤立关键帧，并结合自适应分辨率策略，在固定计算预算下提升了长视频理解性能。

Details

Motivation: 现有视频大模型面临视觉token过多导致上下文窗口不足的问题，且仅选择稀疏帧会丢失时间动态信息，影响对运动和事件连续性的理解。 Method: 将帧选择扩展到时间连贯的短片段（key clips），并采用自适应分辨率策略动态平衡空间分辨率与片段长度，以保持每段视频的token数量恒定。 Result: 在Video-MME、LongVideoBench和MLVU三个长视频基准上，F2C分别比均匀采样提升8.1%、5.6%和10.3%。 Conclusion: 保留时间连贯性对于视频理解至关重要，F2C为扩展视频大模型在实际应用中的使用提供了有效路径。 Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

[155] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

Mario Medrano-Paredes,Carmen Fernández-González,Francisco-Javier Díaz-Pernas,Hichem Saoudi,Javier González-Alonso,Mario Martínez-Zarzuela

Main category: cs.CV

TL;DR: 该研究比较了单目视频3D人体姿态估计模型与惯性测量单元（IMU）在真实场景下对人体运动学评估的性能，使用包含13种日常活动的VIDIMU数据集，结果表明MotionAGFormer表现最优，视频和IMU技术均具可行性但各有权衡。

Details

Motivation: 在非实验室环境下准确评估人体运动对远程医疗、体育科学和康复至关重要，亟需比较新兴视频模型与传统IMU技术的性能。 Method: 利用VIDIMU数据集，将基于视频的深度学习模型（如MotionAGFormer、MotionBERT等）输出的关节角度与基于IMU和OpenSim逆向动力学计算的关节角度进行对比，采用RMSE、MAE、Pearson相关系数和R²进行评估。 Result: MotionAGFormer表现最佳，整体RMSE为9.27°±4.80°，MAE为7.86°±4.18°，Pearson相关系数为0.86±0.15，R²为0.67±0.28；视频与IMU方法在成本、可及性和精度方面存在权衡。 Conclusion: 现成的视频模型在健康成人中已具备临床潜力，但在某些方面仍落后于IMU；本研究为开发可靠、低成本、用户友好的远程健康监测系统提供了重要指导。 Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

[156] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang,Dong Liang,Yihang Zhou

Main category: cs.CV

TL;DR: 提出NeuroSwift方法，通过结合扩散模型中的互补适配器（AutoKL和CLIP）实现高效的跨被试视觉刺激重建，在少量参数微调下达到SOTA性能。

Details

Motivation: 解决基于fMRI数据的跨被试视觉信息重建中因个体神经表征差异和大脑对语义特征抽象编码导致的准确性和计算效率难题。 Method: 引入NeuroSwift，集成AutoKL适配器处理低级特征，CLIP适配器利用Stable Diffusion生成图像与COCO标题配对训练以模拟高级视觉皮层编码；采用预训练+轻量微调策略，仅微调17%全连接层参数实现跨被试泛化。 Result: 在轻量级GPU（三块RTX 4090）上每被试仅需一小时训练即可实现最先进的重建性能，显著优于现有方法。 Conclusion: NeuroSwift通过模块化适配器设计和高效微调策略，实现了高精度、低资源消耗的跨被试视觉解码，推动了脑-图生成技术的实用化发展。 Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

[157] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Sathira Silva,Eman Ali,Chetan Arora,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出microCLIP，一种用于细粒度图像分类的自训练框架，通过引入Saliency-Oriented Attention Pooling（SOAP）和TokenFusion模块，结合LLM生成的文本先验与CLIP的视觉表示，提升无监督适应性能。

Details

Motivation: CLIP模型在细粒度图像分类中因依赖全局粗略特征而表现受限，现有方法缺乏空间精确性，因此需要一种能捕捉局部细微线索的无监督适应方法。 Method: 提出microCLIP框架，核心为TokenFusion模块中的SOAP机制，生成基于显著性的[FG] token并与[CLS] token融合；采用双头分类器（一个冻结、一个可学习）进行多视图对齐与伪标签生成，并通过动态知识聚合迭代优化预测结果。 Result: 在13个细粒度分类基准上平均准确率提升了2.90%，且仅需轻量级适配。 Conclusion: microCLIP有效挖掘了CLIP模型中潜在的细粒度信号，通过结合LLM先验与显著性引导的注意力机制，在无监督场景下显著提升了细粒度图像分类性能。 Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

[158] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park,Yifan Yang,Juheon Yi,Shicheng Zheng,Yifei Shen,Dongqi Han,Caihua Shan,Muhammad Muaz,Lili Qiu

Main category: cs.CV

TL;DR: VidGuard-R1是首个基于多模态大语言模型的视频真实性检测器，采用群体相对策略优化（GRPO）进行微调，在真实与AI生成视频的鉴别任务中达到最先进的准确率（超过95%），并提供可解释的判断依据。

Details

Motivation: 随着AI生成视频的快速发展，亟需有效的检测工具以应对虚假信息和声誉损害等社会风险，同时要求模型具备高准确性和可解释性，以确保对监管者和终端用户的透明度。 Method: 提出VidGuard-R1，通过在包含14万条真实与AI生成视频的高难度数据集上，使用群体相对策略优化（GRPO）微调Qwen-VL多模态大模型，并设计针对时间伪影和生成复杂性的两个专用奖励模型以提升检测性能。 Result: VidGuard-R1在现有基准上实现了最先进的零样本检测性能，经过额外训练后准确率超过95%，并在案例研究中展现出精确且可解释的推理能力。 Conclusion: VidGuard-R1有效结合了多模态大模型与强化学习优化方法，不仅在视频真伪检测上表现优异，还提供了透明、可理解的判断理由，具有实际应用与监管价值。 Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.

[159] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文提出一种无需长视频教师模型监督或长视频数据集重训练的方法，通过利用教师模型的知识对自生成长视频中的片段进行采样指导，有效缓解了长时域视频生成中的质量退化问题。

Details

Motivation: 扩散模型在图像和视频生成中表现出色，但基于Transformer架构的高计算成本限制了其在长视频生成中的应用；现有方法依赖短视域教师模型蒸馏，导致学生模型在超出训练时域后出现严重误差累积和质量下降。 Method: 提出一种自强制增强（self-forcing++）方法，利用教师模型从学生自生成的长视频中采样片段作为监督信号，实现无长视频教师监督的长时域生成，保持时间一致性并避免重复计算重叠帧。 Result: 该方法可将视频长度扩展至教师模型能力的20倍以上，在标准和改进基准上均优于基线方法；最大可生成4分15秒的视频，达到基础模型位置编码支持的最大跨度的99.9%，比基线模型长50多倍。 Conclusion: 所提方法简单有效，显著提升了长时域视频生成的质量与一致性，无需额外长视频训练数据或重新训练，为高效长视频生成提供了新思路。 Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

[160] Learning to Generate Object Interactions with Physics-Guided Video Diffusion

David Romero,Ariana Bermudez,Hao Li,Fabio Pizzati,Ivan Laptev

Main category: cs.CV

TL;DR: 本文提出了KineMask，一种物理引导的视频生成方法，通过结合低层次运动控制和高层次文本条件，在无未来运动监督的情况下生成具有真实刚体交互的视频。

Details

Motivation: 现有视频生成模型在物理合理的物体交互和物理基础的控制机制方面仍存在不足，限制了其在机器人和具身决策等领域的应用。 Method: 提出两阶段训练策略，利用合成场景中的对象掩码逐步去除未来运动监督，训练视频扩散模型；结合指定对象速度与预测性场景描述实现低层运动控制与高层文本条件的融合。 Result: 在真实场景中显著提升了物体交互的 realism 和可控性，实验表明KineMask优于同规模的最新模型，消融研究验证了多级条件的有效互补性。 Conclusion: KineMask实现了物理引导下的高质量视频生成，为视频生成模型在复杂动力学现象模拟和实际应用（如机器人）中提供了新的可能性。 Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.

[161] MultiModal Action Conditioned Video Generation

Yichen Li,Antonio Torralba

Main category: cs.CV

TL;DR: 提出细粒度多模态动作以实现机器人对精细操作的实时控制，通过融合本体感觉、动觉、力触觉和肌肉激活等多模态感知，提升动作模拟的准确性和因果性。

Details

Motivation: 当前视频模型缺乏精细控制能力，难以满足通用家用机器人在精细操作和紧急情况下的实时细粒度运动控制需求。 Method: 引入包含本体感觉、动觉、力触觉和肌肉激活的多模态感知动作，设计特征学习范式对齐各模态并保留其独特信息，并提出正则化方法增强动作轨迹特征的因果性。 Result: 实验表明所提方法提高了模拟精度，减少了时间漂移，在消融实验和下游应用中验证了有效性与实用性。 Conclusion: 融合多模态感知的细粒度动作建模能有效提升机器人世界模型的控制精度与动态表征能力，适用于复杂交互场景。 Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

[162] VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song,Wenhao Chai,Shusheng Yang,Ethan Armand,Xiaojun Shan,Haiyang Xu,Jianwen Xie,Zhuowen Tu

Main category: cs.CV

TL;DR: 本文提出VideoNSA，通过将Native Sparse Attention（NSA）应用于视频-语言模型，提升长视频理解、时序推理和空间任务的性能。

Details

Motivation: 现有视频理解模型受限于上下文长度，难以捕捉关键过渡帧并维持长时间连贯性。 Method: 在Qwen2.5-VL基础上，采用硬件感知的混合注意力机制：文本使用密集注意力，视频使用NSA，并在216K视频指令数据集上进行端到端训练。 Result: 相比基于token压缩和训练-free稀疏方法，VideoNSA在长视频理解、时序推理和空间基准上表现更优；可稳定扩展至128K tokens，发现最优全局-局部注意力分配、任务依赖的分支使用模式，以及可学习的稀疏注意力能形成动态注意力汇聚点。 Conclusion: VideoNSA有效解决了长视频理解中的上下文长度限制问题，为多模态模型中的高效视频处理提供了可行方案。 Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

[163] NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez

Main category: cs.CV

TL;DR: 本文提出了一种名为NoiseShift的训练-free方法，通过根据图像分辨率重新校准去噪器的噪声水平，解决了文本到图像扩散模型在不同分辨率下生成质量不一致的问题。

Details

Motivation: 现有的高分辨率文本到图像生成模型在低分辨率生成时表现不佳，主要由于噪声调度器在不同分辨率下对感知的影响不均，导致训练与测试间的不匹配。 Method: NoiseShift通过调整不同分辨率下的噪声水平来重新校准去噪过程，无需修改模型结构或采样调度，适用于现有模型。 Result: 在Stable Diffusion 3、3.5和Flux-Dev上应用NoiseShift后，低分辨率图像生成质量显著提升。在LAION-COCO和CelebA数据集上，FID指标均有明显改善，例如SD3.5在LAION-COCO上提升了15.89%。 Conclusion: NoiseShift有效缓解了扩散模型中分辨率依赖的生成问题，提升了低分辨率图像生成的质量，具有良好的兼容性和实用性。 Abstract: Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

[164] Inferring Dynamic Physical Properties from Video Foundation Models

Guanqi Zhan,Xianzheng Ma,Weidi Xie,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文研究了从视频中预测动态物理属性（如弹性、粘度和动态摩擦）的任务，提出了新的数据集和三种基于视频的推理方法，并比较了不同模型的性能。

Details

Motivation: 动态物理属性的推断需要时间信息，现有方法在真实场景中的泛化能力有限，因此需要探索基于视频基础模型和多模态大语言模型的新方法。 Method: （1）构建包含合成与真实视频的新数据集；（2）采用三种方法：基于视觉线索的oracle方法、使用预训练视频模型的可学习提示机制、以及针对多模态大语言模型的提示策略。 Result: 生成式或自监督训练的视频基础模型表现相近但略逊于oracle方法，多模态大语言模型目前性能较差，但可通过优化提示提升效果。 Conclusion: 视频基础模型在物理属性预测任务中具有潜力，而多模态大语言模型的表现仍有待提高，合适的提示策略可改善其性能。 Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

[165] Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

Mengyu Yang,Yiming Chen,Haozheng Pei,Siddhant Agarwal,Arun Balajee Vasudevan,James Hays

Main category: cs.CV

TL;DR: 提出了声音物体检测任务，利用多模态对象感知框架从第一人称视频中学习，结合自动分割掩码和槽注意力机制，在声音与物体关联任务上取得最先进性能。

Details

Motivation: 希望模型能够像人类一样通过声音识别出参与交互的物体，探索声音与物体之间的关联。 Method: 提出一种多模态对象感知框架，使用自动管道生成交互对象的分割掩码以引导训练，并采用槽注意力视觉编码器强化对象中心的表示。 Result: 在新提出的声音物体检测任务以及现有的多模态动作理解任务上实现了最先进的性能。 Conclusion: 该方法有效提升了模型对声音与物体关系的理解能力，验证了对象中心学习在音频-视觉任务中的优势。 Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

[166] StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Bo-Hsu Ke,You-Zhe Xie,Yu-Lun Liu,Wei-Chen Chiu

Main category: cs.CV

TL;DR: 本文提出了一种针对3D高斯点阵（3DGS）的密度引导型图像级投毒攻击方法，通过在低密度区域注入高斯点并结合自适应噪声策略，实现对特定视角的隐蔽攻击，同时最小化对正常视角的影响。

Details

Motivation: 随着NeRF和3DGS等3D场景表示方法在新视角合成中的广泛应用，其安全性问题日益重要。本文旨在评估3DGS在图像级投毒攻击下的鲁棒性，并揭示其潜在脆弱性。 Method: 提出密度引导的投毒方法：利用核密度估计（KDE）识别低密度区域，在其中注入携带视图相关幻觉物体的高斯点；引入自适应噪声策略破坏多视角一致性，增强攻击效果。同时设计基于KDE的评估协议以系统衡量攻击难度。 Result: 实验表明，所提方法在攻击有效性上优于现有最先进方法，能在目标视角清晰呈现幻觉物体，同时在非目标视角保持渲染质量，验证了3DGS在面对隐蔽投毒攻击时的脆弱性。 Conclusion: 该研究揭示了3DGS在安全方面的潜在风险，提出了有效的投毒攻击方法与评估框架，为未来3D场景表示模型的鲁棒性研究提供了重要参考。 Abstract: 3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

[167] Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Eric Tillmann Bill,Enis Simsar,Thomas Hofmann

Main category: cs.CV

TL;DR: 本文提出了首个用于多主体文本到图像生成的理论框架，通过随机最优控制优化采样动态，提升主体保真度。

Details

Motivation: 现有T2I模型在处理多主体提示时存在属性泄露、身份纠缠和主体遗漏等问题，缺乏系统性解决方案。 Method: 将流匹配（FM）与随机最优控制（SOC）结合，提出两种算法：无需训练的测试时控制器和轻量级微调方法Adjoint Matching，并引入FOCUS实现主体解耦。 Result: 在Stable Diffusion 3.5、FLUX和SDXL上验证，新方法显著提升多主体对齐性能，保持原有风格，且计算高效、泛化性强。 Conclusion: 该框架为多主体生成提供了可优化的理论基础和有效算法，统一了先前注意力启发方法，并首次实现了面向多主体保真的微调路径。 Abstract: Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

Table of Contents

cs.CL [Back]

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[2] Towards Open-Ended Discovery for Low-Resource NLP

[3] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

[4] Context Matters: Comparison of commercial large language tools in veterinary medicine

[5] ClaimCheck: Real-Time Fact-Checking with Small Language Models

[6] EEFSUVA: A New Mathematical Olympiad Benchmark

[7] Who is In Charge? Dissecting Role Conflicts in Instruction Following

[8] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

[9] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

[10] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

[11] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

[12] Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition

[13] LLMRank: Understanding LLM Strengths for Model Routing

[14] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

[15] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

[16] Silent Tokens, Loud Effects: Padding in LLMs

[17] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

[18] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

[19] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

[20] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

[21] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

[22] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

[23] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

[24] Let's Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models' Understanding of Sports

[25] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

[26] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

[27] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

[28] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

[29] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

[30] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

[31] Longitudinal Monitoring of LLM Content Moderation of Social Issues

[32] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

[33] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

[34] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

[35] OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

[36] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

[37] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

[38] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

[39] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

[40] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

[41] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

[42] HiSpec: Hierarchical Speculative Decoding for LLMs

[43] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

[44] A-VERT: Agnostic Verification with Embedding Ranking Targets

[45] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

[46] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

[47] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

[48] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

[49] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

[50] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

[51] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

[52] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

[53] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

[54] SoK: Measuring What Matters for Closed-Loop Security Agents

[55] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

[56] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

[57] How Do Language Models Compose Functions?

[58] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

[59] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

[60] Machine-interpretable Engineering Design Standards for Valve Specification

[61] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

[62] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

[63] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

[64] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

[65] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

[66] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

[67] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

[68] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

[69] Inverse Language Modeling towards Robust and Grounded LLMs

[70] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

[71] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

[72] Exploring Database Normalization Effects on SQL Generation

[73] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

[74] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

[75] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

[76] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

[77] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

[78] The Disparate Impacts of Speculative Decoding