Table of Contents
cs.CL [Back]
[1] Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation
Wei Zhou,Bolei Ma,Annemarie Friedrich,Mohsen Mesgar
Main category: cs.CL
TL;DR: 本论文综述了基于大语言模型的表格问答(TQA)研究,系统地分类了现有基准和任务设置,总结了针对不同挑战的建模策略及其优缺点,并指出了尚未充分探索的重要研究方向。
Details
Motivation: 当前TQA领域缺乏对任务形式、核心挑战和方法趋势的系统性梳理,尤其是在强化学习等新兴方向背景下,亟需统一框架以促进该领域发展。 Method: 对现有TQA研究进行结构化综述,重点分析基于大语言模型的方法,按挑战对建模策略进行归类,并对基准、任务设置及技术趋势进行系统整理。 Result: 提出了TQA研究的全面分类体系,识别出关键挑战与方法局限性,并揭示了若干重要但未被充分研究的前沿课题。 Conclusion: 该综述为TQA社区提供了统一的知识框架和开放问题清单,有助于深化对该领域现状的理解并指导未来研究方向。 Abstract: Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.[2] Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
Yanran Chen,Lynn Greschner,Roman Klinger,Michael Klenk,Steffen Eger
Main category: cs.CL
TL;DR: 本研究首次探讨了情感框架如何影响逻辑谬误的说服力,利用大语言模型(LLM)在保持逻辑结构不变的前提下注入情感,并通过人类实验发现情感操控使谬误检测F1值平均下降14.5%。
Details
Motivation: 逻辑谬误虽不严谨却常具说服力,研究旨在揭示情感如何增强这种误导性说服力,特别是在公共传播中可能引发的认知偏差。 Method: 使用八个大语言模型进行基准测试,筛选最佳模型用于在谬误论证中注入情感;生成刺激材料后开展人类实验,评估不同情绪状态下的谬误识别能力与说服力感知。 Result: LLM驱动的情感框架使人类对谬误的F1检测率平均下降14.5%;当受众感受到愉悦时,谬误识别表现优于恐惧或悲伤;愉悦、恐惧和悲伤三种情绪显著提升论证的说服力。 Conclusion: 情感 framing 显著削弱人类对逻辑谬误的识别能力,同时增强论证的主观说服力,揭示了AI在谬误传播中潜在的情感操纵风险。 Abstract: Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.[3] The Idola Tribus of AI: Large Language Models tend to perceive order where none exists
Shin-nosuke Ishikawa,Masato Todo,Taiki Ogihara,Hirotsugu Ohba
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)在识别数字序列规律时,倾向于虚构不存在的模式,暴露出其在逻辑一致性和自洽性方面的缺陷,即使采用多步推理机制也难以避免,类似“部落幻象”的认知偏差。
Details
Motivation: 评估大型语言模型在逻辑推理任务中的自洽性,并揭示其在处理无规律数据时可能产生的错误模式识别倾向。 Method: 通过让LLMs解释多种整数序列(包括算术、几何和随机生成序列)的规律,分析其在不同序列类型中的表现,特别是对随机序列是否存在过度识别模式的现象。 Result: 模型能正确识别算术和几何序列的规律,但在随机序列中频繁虚构不一致的模式,且该问题存在于多个先进模型(如OpenAI o3、o4-mini、Google Gemini 2.5 Flash等)中。 Conclusion: LLMs存在类似‘部落幻象’的认知偏差,倾向于强行构造规律,这限制了其在需要严谨逻辑推理的实际应用中的可靠性,即使使用思维链等推理机制也无法完全克服。 Abstract: We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.[4] SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG
Xiaonan Si,Meilin Zhu,Simeng Qin,Lijia Yu,Lijun Zhang,Shuaitong Liu,Xinfeng Li,Ranjie Duan,Yang Liu,Xiaojun Jia
Main category: cs.CL
TL;DR: 提出SeCon-RAG,一种两阶段语义过滤与无冲突框架,通过实体-意图-关系提取器(EIRE)实现检索增强生成系统的可信知识融合,有效抵御语义污染并提升生成可靠性。
Details
Motivation: 现有RAG系统易受语料投毒和污染攻击,且防御方法常过度过滤导致有价值信息丢失,影响生成可靠性。 Method: 第一阶段结合语义与聚类过滤,利用EIRE提取用户查询和文档中的实体、意图与关系,评估语义相关性并构建干净检索库;第二阶段引入EIRE引导的冲突感知过滤模块,分析查询、候选答案与检索知识间的语义一致性,过滤内外部矛盾。 Result: 在多个LLM和数据集上实验表明,SeCon-RAG显著优于现有防御方法,在保持有用知识的同时有效减少冲突污染,提升了生成的鲁棒性和可信度。 Conclusion: SeCon-RAG通过两阶段EIRE引导的语义与冲突过滤机制,实现了更可靠的知识检索与生成,增强了RAG系统对污染攻击的防御能力。 Abstract: Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) with external knowledge but are vulnerable to corpus poisoning and contamination attacks, which can compromise output integrity. Existing defenses often apply aggressive filtering, leading to unnecessary loss of valuable information and reduced reliability in generation. To address this problem, we propose a two-stage semantic filtering and conflict-free framework for trustworthy RAG. In the first stage, we perform a joint filter with semantic and cluster-based filtering which is guided by the Entity-intent-relation extractor (EIRE). EIRE extracts entities, latent objectives, and entity relations from both the user query and filtered documents, scores their semantic relevance, and selectively adds valuable documents into the clean retrieval database. In the second stage, we proposed an EIRE-guided conflict-aware filtering module, which analyzes semantic consistency between the query, candidate answers, and retrieved knowledge before final answer generation, filtering out internal and external contradictions that could mislead the model. Through this two-stage process, SeCon-RAG effectively preserves useful knowledge while mitigating conflict contamination, achieving significant improvements in both generation robustness and output trustworthiness. Extensive experiments across various LLMs and datasets demonstrate that the proposed SeCon-RAG markedly outperforms state-of-the-art defense methods.[5] ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models
Wenbin Guo,Xin Wang,Jiaoyan Chen,Lingbing Guo,Zhao Li,Zirui Chen
Main category: cs.CL
TL;DR: 提出ReaLM框架,通过残差向量量化将知识图谱嵌入离散化为可学习的token序列,实现结构化知识与大语言模型的有效融合,并引入本体引导的类别约束提升语义一致性,在基准数据集上达到SOTA性能。
Details
Motivation: 现有基于大语言模型的知识图谱补全方法难以有效利用结构化语义表示,因预训练KG嵌入空间与LLM的离散token空间不一致,导致语义迁移困难。 Method: 提出ReaLM框架,采用残差向量量化技术将KG嵌入离散化为紧凑的代码序列,并将其作为可学习token融入LLM词汇表;同时引入本体引导的类别约束,增强实体预测的语义一致性。 Result: 在两个主流基准数据集上的实验表明,ReaLM在知识图谱补全任务上显著优于现有方法,实现了最先进的性能。 Conclusion: ReaLM有效弥合了知识图谱嵌入与大语言模型之间的模态鸿沟,实现了结构化知识与上下文语言模型的无缝融合,为知识感知的语言推理提供了新思路。 Abstract: Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.[6] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Shiyuan Guo,Henry Sleight,Fabien Roger
Main category: cs.CL
TL;DR: 研究表明,当前AI模型在加密文本中推理时表现显著下降,尽管能准确解密,表明通过加密推理绕过链式思维监控对现有模型效果有限。
Details
Motivation: 随着AI代理的普及,检测有害AI行为变得重要。链式思维(CoT)监控常用于检测对抗攻击和AI错位,但存在模型通过加密推理逃避监控的风险,因此需要评估这一威胁。 Method: 研究测试了28种不同密码,通过对最多10个模型进行微调和提示,使其使用特定密码进行推理,并以数学问题准确率作为推理能力的衡量指标,同时评估模型对加密文本的理解与翻译能力。 Result: 发现模型在加密文本中推理时准确率显著下降,即使能准确翻译加密内容;前沿模型对知名密码(如rot13)表现较好,但对冷门密码仍困难;加密推理能力与预训练数据中的密码流行度相关,且随微调数据增加缓慢提升。 Conclusion: 目前模型通过加密推理规避CoT监控的效果有限,未来可通过控制训练数据中的密码暴露来限制该能力的发展。 Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.[7] Preference-Aware Memory Update for Long-Term LLM Agents
Haoran Sun,Zekun Zhang,Shaoning Zeng
Main category: cs.CL
TL;DR: 提出了一种偏好感知的记忆更新机制(PAMU),结合滑动窗口平均和指数移动平均,动态优化LLM代理的长期记忆,提升其在长周期对话中的表现。
Details
Motivation: 现有长期记忆机制在记忆更新方面不足,缺乏根据用户行为和上下文变化动态调整偏好记忆的能力。 Method: 提出PAMU机制,融合滑动窗口平均(SW)和指数移动平均(EMA),构建兼顾短期波动与长期趋势的偏好感知记忆表示。 Result: 在LoCoMo数据集的五个任务场景中实验表明,PAMU能显著提升五种基线模型下LLM的输出质量。 Conclusion: PAMU有效实现了个性化、动态的记忆更新,增强了LLM代理在长期交互中的推理与决策能力。 Abstract: One of the key factors influencing the reasoning capabilities of LLM-based agents is their ability to leverage long-term memory. Integrating long-term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference-Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference-aware representation that captures both short-term fluctuations and long-term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long-term conversations.[8] Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation
Fanwei Zhu,Jinke Yu,Zulong Chen,Ying Zhou,Junhao Ji,Zhibo Yang,Yuxue Zhang,Haoyuan Hu,Zhenghao Liu
Main category: cs.CL
TL;DR: 提出了一种布局感知且高效优化的框架,用于自动化简历信息提取与评估,解决了简历格式多样、大模型成本高和缺乏标准化数据集三大挑战。
Details
Motivation: 现有简历信息提取方法在实际部署中面临简历布局和内容异质性高、大语言模型成本高以及缺乏标准化数据集和评估工具的问题。 Method: 结合微调的布局解析器来归一化文档格式,使用基于并行提示和指令调优的高效小规模LLM进行信息提取,并构建包含新基准数据集的两阶段自动化评估框架。 Result: 实验表明该框架在准确性和效率上显著优于强基线方法,一个0.6B参数的紧凑型LLM即可达到顶尖精度,同时大幅降低推理延迟和计算成本,并已部署于阿里巴巴智能HR平台。 Conclusion: 所提框架有效平衡了精度与效率,适用于大规模真实场景的简历信息提取,具备良好的实用性和可扩展性。 Abstract: Automated resume information extraction is critical for scaling talent acquisition, yet its real-world deployment faces three major challenges: the extreme heterogeneity of resume layouts and content, the high cost and latency of large language models (LLMs), and the lack of standardized datasets and evaluation tools. In this work, we present a layout-aware and efficiency-optimized framework for automated extraction and evaluation that addresses all three challenges. Our system combines a fine-tuned layout parser to normalize diverse document formats, an inference-efficient LLM extractor based on parallel prompting and instruction tuning, and a robust two-stage automated evaluation framework supported by new benchmark datasets. Extensive experiments show that our framework significantly outperforms strong baselines in both accuracy and efficiency. In particular, we demonstrate that a fine-tuned compact 0.6B LLM achieves top-tier accuracy while significantly reducing inference latency and computational cost. The system is fully deployed in Alibaba's intelligent HR platform, supporting real-time applications across its business units.[9] VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
Yubo Sun,Chunyi Peng,Yukun Yan,Shi Yu,Zhenghao Liu,Chi Chen,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出了EVisRAG,一种端到端的视觉检索增强生成框架,通过证据引导的多图像推理和RS-GRPO训练方法,显著提升了视觉语言模型在多图像场景下的感知与推理能力。
Details
Motivation: 现有VRAG系统在多图像感知与证据整合方面表现不佳,导致推理缺乏可靠依据并产生幻觉问题。 Method: 提出EVisRAG框架:首先从检索图像中提取每幅图的证据,再基于聚合证据推导答案;采用RS-GRPO方法,将细粒度奖励绑定到特定范围的token,联合优化模型的视觉感知与推理能力。 Result: 在多个视觉问答基准上,EVisRAG相比基线VLM平均提升27%;分析表明其能精准定位问题相关的多图像证据并据此生成答案。 Conclusion: EVisRAG通过端到端学习实现了更可靠的多图像证据整合,显著增强了VLM的推理准确性与可解释性,类似于侦探式的推理过程。 Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.[10] Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
Steve Han,Gilberto Titericz Junior,Tom Balough,Wenfei Zhou
Main category: cs.CL
TL;DR: 本文提出了Judge's Verdict Benchmark,一种两步法评估大语言模型(LLM)作为判断者在响应准确性任务中的表现。研究评估了54个LLM在RAG或代理管道中复制人类判断的能力,发现27个模型达到一级性能,其中23个表现出类人判断模式,4个表现出超一致性。结果表明,判断能力不仅取决于模型大小,更依赖于特定训练策略。
Details
Motivation: 现有研究多依赖相关性分析评估LLM作为判断者的有效性,但缺乏对实际一致性模式的深入考察。本文旨在提出更全面的评估方法,揭示LLM判断是否真正模拟人类判断行为,并识别可能的过度简化风险。 Method: 采用两步评估方法:第一步通过相关性测试筛选出与人类判断高度对齐的模型;第二步使用Cohen's Kappa分析和z-score进行人类相似性检验,区分出类人判断(|z| < 1)和超一致判断(z > 1)两种模式。测试涵盖43个开源模型和11个闭源模型。 Result: 在54个测试的LLM中,27个达到一级性能:23个表现出类人判断模式,保留了人类判断的细微差别;4个表现出超一致性,超过典型的人类间一致性水平。结果显示模型大小并非决定因素,特定训练策略更为关键。 Conclusion: 仅靠相关性不足以评估LLM作为判断者的表现;本文提出的基于一致性模式的‘法官图灵测试’和分级基准为未来LLM评估提供了标准化工具,有助于根据不同需求选择合适的判断模型。 Abstract: This research introduces the Judge's Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen's Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a "Turing Test for judges" based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.[11] Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning
Adam Byerly,Daniel Khashabi
Main category: cs.CL
TL;DR: 本文提出了一种名为Gold Panning Bandits的框架,利用大语言模型在多文档场景中的位置偏差作为诊断信号,通过重排序文档并观察模型响应变化来高效识别最相关内容,显著减少了语言模型查询次数和计算成本。
Details
Motivation: 大语言模型在多文档上下文中存在位置偏差,通常被视为需要缓解的噪声;本文旨在将这种偏差转化为可利用的信号,以提高信息检索效率。 Method: 将文档重排序的选择问题建模为二分图匹配问题,提出一种基于贪心策略的O(N log N)算法,优先将不确定性最高的文档放置在信息量最大的位置,从而更高效地识别相关文档。 Result: 相比随机排列基线,该方法在知识密集型NLP任务中最多减少65%的语言模型查询量,且性能相当。 Conclusion: 大语言模型的固有偏差可以在推理时被转化为优化资产,实现无需重新训练的高效内容筛选。 Abstract: Large language models exhibit a strong position bias in multi-document contexts, systematically prioritizing information based on location rather than relevance. While existing approaches treat this bias as noise to be mitigated, we introduce Gold Panning Bandits, a framework that leverages position bias as a diagnostic signal: by reordering documents and observing shifts in the model's responses, we can efficiently identify the most relevant content. We frame the problem of choosing reorderings as a bipartite matching problem. While an optimal assignment can be computed at each iteration with the Hungarian algorithm in $O(N^3)$ time, we propose a greedy $O(N \log N)$ strategy that achieves comparable performance by prioritizing the placement of the most uncertain documents in the most informative positions. Our approach identifies relevant documents using up to 65\% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without model retraining. This work demonstrates that inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization.[12] PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection
Rakib Hossan,Shubhashis Roy Dipta
Main category: cs.CL
TL;DR: PromptGuard 是一种用于低资源语言(如孟加拉语)的少样本仇恨言论分类框架,在 BLP-2025 Task 1A 中表现优异,结合卡方统计关键词提取与自适应多数投票机制,显著优于基线方法。
Details
Motivation: 针对低资源语言标注数据稀缺的问题,传统监督方法成本高且难以扩展,因此需要一种高效的少样本分类方法用于多类别仇恨言论识别。 Method: 提出 PromptGuard 框架:使用卡方统计进行关键词提取,并引入自适应多数投票机制,根据共识质量动态决定是否进行多轮分类以提升准确性。 Result: 在 BLP-2025 Task 1A 上达到 67.61 的 micro-F1 分数,优于 n-gram 基线(60.75)和随机方法(14.65);消融实验表明卡方关键词在所有类别中具有一致性提升效果。 Conclusion: PromptGuard 在少样本条件下有效提升了孟加拉语仇恨言论分类性能,尤其适用于标注资源有限的低资源语言场景。 Abstract: The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.[13] Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models
Michael Freenor,Lauren Alvarez
Main category: cs.CL
TL;DR: 本文提出了RISE(旋转不变移位估计)方法,用于在现代语言模型的嵌入空间中以几何方式表示语义变换,首次系统性证明了话语级语义变换在多语言嵌入空间中对应一致的几何操作。
Details
Motivation: 理解语言和嵌入模型如何编码语义关系对于模型可解释性和控制至关重要。早期词嵌入具有直观的向量算术性质,但现代高维表示缺乏明确的几何可解释性,因此需要新方法揭示其内在结构。 Method: 提出RISE方法,将语义变换建模为嵌入空间中的旋转操作,利用现代语言表示的流形结构,并在多种语言和模型间进行迁移评估。 Result: 在三种嵌入模型、三个数据集和七种形态多样的语言上验证了RISE的有效性,表明其能跨语言和模型一致地映射具有特定语法特征(如否定和条件)的话语级语义变换。 Conclusion: RISE揭示了多语言嵌入空间中话语级语义变换存在一致的几何操作,实证支持了句子层面的线性表示假设。 Abstract: Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability and control. While early word embeddings exhibited intuitive vector arithmetic (''king'' - ''man'' + ''woman'' = ''queen''), modern high-dimensional text representations lack straightforward interpretable geometric properties. We introduce Rotor-Invariant Shift Estimation (RISE), a geometric approach that represents semantic transformations as consistent rotational operations in embedding space, leveraging the manifold structure of modern language representations. RISE operations have the ability to operate across both languages and models with high transfer of performance, suggesting the existence of analogous cross-lingual geometric structure. We evaluate RISE across three embedding models, three datasets, and seven morphologically diverse languages in five major language groups. Our results demonstrate that RISE consistently maps discourse-level semantic transformations with distinct grammatical features (e.g., negation and conditionality) across languages and models. This work provides the first systematic demonstration that discourse-level semantic transformations correspond to consistent geometric operations in multilingual embedding spaces, empirically supporting the Linear Representation Hypothesis at the sentence level.[14] Text Prompt Injection of Vision Language Models
Ruizhe Zhu
Main category: cs.CL
TL;DR: 本研究探讨了一种简单而有效的文本提示注入攻击方法,用于误导大型视觉语言模型,并通过实验证明了该方法的有效性和高效性。
Details
Motivation: 随着大型视觉语言模型的广泛应用,安全问题日益突出,尤其是模型易受恶意提示影响的风险。因此,研究此类攻击方法有助于理解模型脆弱性并提升其安全性。 Method: 提出一种针对大型视觉语言模型的文本提示注入攻击算法,通过设计特定输入提示来误导模型行为,并在多种模型上进行实验验证。 Result: 实验结果表明,所提出的攻击方法在大型模型上具有较高的成功率,且相比其他攻击方法对计算资源的需求更低,展现出良好的效率和可扩展性。 Conclusion: 文本提示注入是一种低资源、高效的攻击方式,对大型视觉语言模型构成显著安全威胁,需引起重视并采取防御措施。 Abstract: The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.[15] NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering
Kaiwen Shi,Zheyuan Zhang,Zhengqing Yuan,Keerthiram Murugesan,Vincent Galass,Chuxu Zhang,Yanfang Ye
Main category: cs.CL
TL;DR: 本文提出了一种名为NG-Router的新框架,通过知识图谱引导的多智能体协作来解决营养学问答中的推理能力不足和上下文过载问题,显著提升了复杂营养健康任务的性能。
Details
Motivation: 现有营养问答系统在单智能体推理能力和多智能体架构设计上存在局限,且易受上下文过载影响,难以准确决策。 Method: 将营养问答建模为受监督的知识图谱引导的多智能体协作问题,引入智能体节点到异构知识图谱中,并利用图神经网络学习任务感知的智能体路由分布,同时采用基于梯度的子图检索机制以减少上下文干扰。 Result: 在多个基准数据集和骨干模型上的实验表明,NG-Router在性能上持续优于单智能体和集成基线方法。 Conclusion: NG-Router为复杂营养健康任务提供了一种有效的领域感知多智能体推理方法,具备更强的多跳和关系推理能力。 Abstract: Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.[16] NarraBench: A Comprehensive Framework for Narrative Benchmarking
Sil Hamilton,Matthew Wilkens,Andrew Piper
Main category: cs.CL
TL;DR: 本文提出了NarraBench,一个基于理论的叙事理解任务分类体系,并调查了该领域现有的78个基准。研究发现现有基准仅覆盖了27%的叙事任务,某些关键方面如事件、风格、视角和揭示几乎被忽视,且缺乏对主观性和多视角性叙事特征的评估能力。
Details
Motivation: 当前自然语言处理中的叙事理解评估存在明显不足,许多重要叙事维度未被充分覆盖,且缺乏与理论对齐的系统性基准,因此需要构建更全面、理论驱动的评估体系。 Method: 提出一个理论驱动的叙事理解任务分类法(NarraBench),并对78个现有基准进行系统性调查,分析其覆盖范围与叙事理论要素之间的对齐程度。 Result: 发现仅有27%的叙事任务被现有基准良好覆盖,叙事事件、风格、视角和揭示等关键方面严重缺失;同时指出当前基准难以评估具有主观性和多视角性的叙事特征。 Conclusion: NarraBench为评估大语言模型的叙事理解能力提供了有价值的分类框架和方法论指导,强调未来需开发更全面、能处理主观和多视角叙事的基准测试。 Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.[17] CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
Nafiseh Nikeghbal,Amir Hossein Kargaran,Jana Diesner
Main category: cs.CL
TL;DR: 本文提出了CoBia,一种轻量级对抗性攻击工具包,用于系统分析大语言模型(LLM)在对话中偏离伦理行为的条件。通过构建包含偏见陈述的对话,评估LLM是否能从自身生成的偏见中恢复并拒绝后续偏见问题。研究评估了11个开源和专有LLM在性别、种族、宗教等六个社会人口类别上的表现,发现构造性对话可可靠揭示偏见放大现象,且LLM常无法拒绝偏见延续,暴露了深层嵌入的偏见问题。
Details
Motivation: 尽管大型语言模型在标准安全检测中表现提升,但在对话中仍可能表现出有害行为(如种族主义观点)。现有安全机制难以捕捉这些动态浮现的偏见,因此需要系统化方法来探测模型在交互过程中何时及如何偏离伦理规范。 Method: 提出CoBia——一套轻量级对抗性攻击工具,通过构造特定对话使模型首先说出关于某社会群体的偏见主张,然后测试其能否自我纠正并拒绝基于该偏见的后续提问。评估涵盖11个LLM,针对六类敏感属性(性别、种族、宗教、国籍、性取向等),采用现有LLM偏见度量指标,并与人类判断对比以评估模型可靠性与对齐程度。 Result: 实验结果表明,通过精心设计的对话可稳定触发LLM的偏见表达,且多数模型未能有效拒绝基于先前偏见的后续问题,显示出偏见放大现象。这说明当前LLM在对话情境下存在深层嵌入的偏见,难以通过静态安全机制完全消除。LLM-based偏见指标与人类判断存在一定差异,提示自动评估的局限性。 Conclusion: CoBia提供了一种有效的压力测试方法,揭示了LLM在交互过程中潜在的偏见演化路径。研究强调仅靠预部署的安全防护不足,需在动态对话中持续监测和干预模型行为,以提升其真实场景中的安全性与伦理对齐水平。 Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.[18] iBERT: Interpretable Style Embeddings via Sense Decomposition
Vishal Anand,Milad Alshomary,Kathleen McKeown
Main category: cs.CL
TL;DR: iBERT是一种可解释且可控的嵌入编码器,通过将每个词元表示为上下文无关的语义向量的稀疏非负混合,显式分解语言中的判别性特征(如风格和语义),在风格任务上表现优异并保持语义性能。
Details
Motivation: 现有语言模型的嵌入缺乏可解释性和可控性,难以理解其捕捉到的语言特征(如风格、语义)的具体组成,限制了对表示的精细控制与分析。 Method: 提出iBERT模型,将每个输入词元表示为k个上下文无关的、非负且稀疏的语义向量的混合,支持在句子或词元级别进行池化,并实现对嵌入表示的模块化控制。 Result: 在STEL基准上风格表示效果比SBERT基线提升约8个百分点,在作者验证任务中保持竞争力;能够将特定风格属性(如表情符号使用、正式程度、拼写错误)映射到具体的语义向量。 Conclusion: iBERT通过结构化的稀疏组合实现了语言表示的可解释性与可控性,不仅适用于风格建模,还能推广到混合语义与风格信号的场景,具有良好的泛化能力。 Abstract: We present iBERT (interpretable-BERT), an encoder to produce inherently interpretable and controllable embeddings - designed to modularize and expose the discriminative cues present in language, such as stylistic and semantic structure. Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used directly at the token level. This enables modular control over representation, before any decoding or downstream use. To demonstrate our model's interpretability, we evaluate it on a suite of style-focused tasks. On the STEL benchmark, it improves style representation effectiveness by ~8 points over SBERT-style baselines, while maintaining competitive performance on authorship verification. Because each embedding is a structured composition of interpretable senses, we highlight how specific style attributes - such as emoji use, formality, or misspelling can be assigned to specific sense vectors. While our experiments center on style, iBERT is not limited to stylistic modeling. Its structural modularity is designed to interpretably decompose whichever discriminative signals are present in the data - enabling generalization even when supervision blends stylistic and semantic factors.[19] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch,Lei Gao,Chaoyi Jiang,Murali Annavarm
Main category: cs.CL
TL;DR: DELTA是一种无需训练的稀疏注意力机制,通过分层选择重要token,在保持大推理模型准确率的同时显著降低计算开销。
Details
Motivation: 大推理模型在长链推理中计算成本高,现有稀疏注意力方法因累积误差和token重要性动态变化导致精度下降。 Method: 将Transformer层分为三组:使用全注意力的初始层、通过聚合注意力分数选择关键token的选择层,以及仅关注所选token的稀疏注意力层。 Result: 在AIME和GPQA-Diamond等推理基准上,DELTA在精度上匹配或超越全注意力,减少最多5倍的 attended token 数量,并实现1.5倍端到端加速。 Conclusion: DELTA通过选择性重用中间注意力图,在不牺牲准确性的情况下实现了高效的长上下文推理。 Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $5\times$ and delivering $1.5\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.[20] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs
Xu Pan,Ely Hahami,Jingxuan Fan,Ziqian Xie,Haim Sompolinsky
Main category: cs.CL
TL;DR: 该论文研究了自回归大语言模型(arLLMs)在知识注入方面的局限性,特别是“反转诅咒”问题,并对比了掩码扩散大语言模型(dLLMs)在微调阶段的知识获取能力。实验表明dLLMs在前向和反向问答中表现更优且无需数据增强,而arLLMs依赖大量同义改写且受限于信息顺序。受dLLM启发,作者提出一种新的掩码微调方法,显著提升了arLLM的数据效率,缩小了二者性能差距。
Details
Motivation: arLLMs在微调时难以有效注入新知识,尤其存在‘反转诅咒’问题,即无法处理训练样本中信息顺序颠倒的问题;而dLLMs在预训练阶段已表现出对这一问题的免疫能力,但其在微调阶段的表现尚不清楚。因此,本文旨在探究dLLMs在后训练阶段的知识泛化能力,并尝试改进arLLMs的知识注入效率。 Method: 在三个不同数据集上对arLLMs和dLLMs进行微调,采用前向与反向风格的问答任务评估模型的知识泛化能力和对‘反转诅咒’的抵抗性;同时提出一种新的掩码微调范式,用于提升arLLMs的知识注入效率。 Result: dLLMs在无需同义改写的情况下,在前向和反向问答中均达到高准确率,加入改写后性能提升有限;arLLMs则严重依赖信息顺序匹配的改写数据才能实现知识泛化;所提出的掩码微调方法显著提高了arLLMs的数据效率,几乎消除了与dLLMs的性能差距。 Conclusion: dLLMs在知识注入和泛化方面优于传统arLLMs,且不受‘反转诅咒’影响;受dLLM启发的掩码微调策略可有效提升arLLMs的微调效率,为改进现有语言模型提供了新方向。 Abstract: Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.[21] Abductive Preference Learning
Yijin Ni,Peng Qi
Main category: cs.CL
TL;DR: 本文提出了一种名为“溯因偏好学习”的新范式,通过反转传统条件机制来提升大语言模型对反事实提示的敏感性,结合标准与溯因DPO的多任务目标显著提高了响应选择和提示区分的准确性。
Details
Motivation: 现有基于人类反馈的强化学习和直接偏好优化方法过于关注给定提示下的正确回答,忽视了应改变回答的反事实提示,导致模型出现过度自信问题。 Method: 提出了溯因偏好学习,学习在给定回答下对提示的偏好,并构建了来自HaluEval QA基准的溯因数据集,实现了溯因DPO及其变体DPOP。 Result: 实验表明,多任务DPOP在溯因数据集上将响应选择准确率从90.0%提升至99.5%,提示区分率从54.7%提升至85.0%,并在AlpacaEval上将胜率从5.26%提高到6.17%。 Conclusion: 溯因偏好学习有效解决了传统偏好优化中忽略反事实提示的问题,在保持原有优势的同时显著提升了模型对提示差异的敏感性和整体性能。 Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0\%$ to $99.5\%$ in response selection and $54.7\%$ to $85.0\%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26\%$ to $6.17\%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.[22] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection
Guanming Chen,Lingzhi Shen,Xiaohao Cai,Imran Razzak,Shoaib Jameel
Main category: cs.CL
TL;DR: 提出了一种受大脑启发的文本性格检测框架HIPPD,通过模拟大脑分层信息处理机制,在多个数据集上优于现有方法。
Details
Motivation: 现有机器学习方法在捕捉多文本上下文信息和稀疏语义环境下的特征提取方面存在不足。 Method: HIPPD框架结合大语言模型模拟大脑皮层进行语义推理,使用动态记忆模块(模拟前额叶皮层)进行特征选择,并通过类似基底神经节的轻量模型以胜者通吃机制动态路由识别性格模式。 Result: 在Kaggle和Pandora数据集上的实验表明,HIPPD consistently 优于最先进的基线方法。 Conclusion: HIPPD通过模仿大脑的分层处理和神经反馈机制,有效提升了文本性格检测的性能。 Abstract: Personality detection from text aims to infer an individual's personality traits based on linguistic patterns. However, existing machine learning approaches often struggle to capture contextual information spanning multiple posts and tend to fall short in extracting representative and robust features in semantically sparse environments. This paper presents HIPPD, a brain-inspired framework for personality detection that emulates the hierarchical information processing of the human brain. HIPPD utilises a large language model to simulate the cerebral cortex, enabling global semantic reasoning and deep feature abstraction. A dynamic memory module, modelled after the prefrontal cortex, performs adaptive gating and selective retention of critical features, with all adjustments driven by dopaminergic prediction error feedback. Subsequently, a set of specialised lightweight models, emulating the basal ganglia, are dynamically routed via a strict winner-takes-all mechanism to capture the personality-related patterns they are most proficient at recognising. Extensive experiments on the Kaggle and Pandora datasets demonstrate that HIPPD consistently outperforms state-of-the-art baselines.[23] Don't Throw Away Your Pretrained Model
Shangbin Feng,Wenhao Yu,Yike Wang,Hongming Zhang,Yulia Tsvetkov,Dong Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为Switch Generation的模型协作方法,通过在预训练模型和对齐模型之间动态切换生成响应序列,以兼顾推理能力和创造性等不同技能。实验表明该方法在18个任务中的16个上优于单一模型,并平均比基线方法提升12.9%。
Details
Motivation: 对齐训练虽然提升了语言模型的推理和指令遵循能力,但可能削弱其创造力和校准能力。作者希望结合对齐模型和未对齐基础模型的优势,实现互补。 Method: 提出Switch Generation方法,利用一个切换器LM学习在不同查询和上下文中选择哪个模型生成下一个片段,并在推理时动态引导不同模型版本交替生成响应。 Result: 在8种协作基线和18个数据集上的广泛实验显示,模型协作在18项任务中16项优于单个模型,而Switch Generation平均比基线高出12.9%。该方法还能发现组合技能,解决单个模型难以处理的问题,并泛化到未见模型和任务。 Conclusion: Switch Generation能够有效整合不同模型的优势,在多样化任务中表现优越,同时重用训练过程中的副产品,提升整体性能。 Abstract: Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak'' in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.[24] Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning
Sicong Huang,Qianqi Yan,Shengze Wang,Ian Lane
Main category: cs.CL
TL;DR: 本文研究了如何通过微调策略减少大语言模型在生成摘要时产生的不忠实现象,提出了一种包含忠实和不忠实摘要及片段级标签的新数据集,并评估了三种微调方法,其中似然训练效果最佳。
Details
Motivation: 大语言模型在生成流畅摘要的同时常产生幻觉问题,现有缓解策略无法充分应对各种错误,因此需要更有效的微调方法来提升生成摘要的忠实性。 Method: 利用多种大语言模型生成训练集文档的摘要,并使用GPT-4o进行片段级幻觉标注,基于这些标注数据采用梯度上升、非似然训练和任务向量否定三种技术对模型进行微调。 Result: 实验结果表明,三种方法均能有效利用片段级标注提升摘要忠实性,其中非似然训练最为有效。 Conclusion: 通过引入片段级幻觉标注数据并采用适当的微调策略,特别是非似然训练,可以显著提高大语言模型生成摘要的忠实性。 Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.[25] Unpacking Hateful Memes: Presupposed Context and False Claims
Weibin Cai,Jiayu Li,Reza Zafarani
Main category: cs.CL
TL;DR: 提出SHIELD框架,通过建模预设上下文和识别虚假主张来检测仇恨模因,结合PCM和FACT模块,在多数据集上超越现有方法。
Details
Motivation: 现有仇恨模因检测方法主要依赖预训练语言模型,缺乏对‘什么使模因具有仇恨性’的深入理解,本文从哲学和心理学出发,探讨其本质特征。 Method: 提出SHIELD框架,包含PCM模块用于建模跨模态的预设上下文,以及FACT模块利用外部知识和跨模态引用图检测虚假主张。 Result: 实验表明SHIELD在多个数据集和指标上优于现有最先进方法,并在假新闻检测等任务中展现出良好泛化能力。 Conclusion: 仇恨模因的核心在于预设上下文和虚假主张,SHIELD通过捕捉这两个本质特征,有效提升检测性能并具备跨任务适用性。 Abstract: While memes are often humorous, they are frequently used to disseminate hate, causing serious harm to individuals and society. Current approaches to hateful meme detection mainly rely on pre-trained language models. However, less focus has been dedicated to \textit{what make a meme hateful}. Drawing on insights from philosophy and psychology, we argue that hateful memes are characterized by two essential features: a \textbf{presupposed context} and the expression of \textbf{false claims}. To capture presupposed context, we develop \textbf{PCM} for modeling contextual information across modalities. To detect false claims, we introduce the \textbf{FACT} module, which integrates external knowledge and harnesses cross-modal reference graphs. By combining PCM and FACT, we introduce \textbf{\textsf{SHIELD}}, a hateful meme detection framework designed to capture the fundamental nature of hate. Extensive experiments show that SHIELD outperforms state-of-the-art methods across datasets and metrics, while demonstrating versatility on other tasks, such as fake news detection.[26] Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Mir Tafseer Nayeem,Sawsan Alqahtani,Md Tahmid Rahman Laskar,Tasnim Mohiuddin,M Saiful Bari
Main category: cs.CL
TL;DR: 本文提出了一种新的度量指标STRR(单令牌保留率),用于评估大语言模型中多语言分词器的公平性,弥补了传统生育率指标在跨语言和跨领域分析中的不足。
Details
Motivation: 现有的分词评估指标‘生育率’无法充分反映词汇在不同语言和领域间的分配公平性,尤其掩盖了非英语语言的分词碎片化问题,因此需要一个更具解释性的指标。 Method: 分析了六种广泛使用的分词器在七种语言和两个领域上的表现,提出了单令牌保留率(STRR)作为新指标,并结合生育率进行对比评估。 Result: 发现英语分词稳定、中文生育率高、领域敏感性低;STRR揭示出分词器对英语的系统性优先以及印地语的严重碎片化问题。 Conclusion: STRR能够有效补充生育率的盲点,为设计更公平的多语言分词器提供了实用指导。 Abstract: Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.[27] Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey
Jiaqi Wei,Xiang Zhang,Yuejin Yang,Wenxuan Huang,Juntai Cao,Sheng Xu,Xiang Zhuang,Zhangyang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Chenyu You,Wanli Ouyang,Siqi Sun
Main category: cs.CL
TL;DR: 本文提出了一个统一的框架,将推理树搜索分解为三个核心组件:搜索机制、奖励公式和转移函数,明确了测试时扩展与自改进中奖励信号的不同作用,推动了自主自改进智能体的系统性研究。
Details
Motivation: 当前基于大语言模型的推理树搜索领域缺乏统一的形式化框架,尤其是奖励信号的作用不明确,导致研究分散,难以系统推进。 Method: 提出一种三组件统一框架(搜索机制、奖励公式、转移函数),区分用于测试时扩展的瞬态搜索引导和用于自改进的持久参数化奖励建模,并构建以组件为中心的分类法。 Result: 建立了搜索算法的统一形式化模型,提出了清晰的术语体系与分类法,整合了现有前沿工作,并为未来研究提供了路线图。 Conclusion: 通过形式化分解,本文澄清了奖励信号在不同场景下的角色,为测试时扩展和模型自改进提供了统一且系统的研究基础。 Abstract: Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.[28] Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations
Yimin Xiao,Yongle Zhang,Dayeon Ki,Calvin Bao,Marianna J. Martindale,Charlotte Vaughn,Ge Gao,Marine Carpuat
Main category: cs.CL
TL;DR: 该研究通过在公共博物馆进行的人类实验,探讨了机器翻译中的流利性和充分性错误如何影响双语和非双语用户对机器翻译的依赖。
Details
Motivation: 随着机器翻译(MT)的普及,理解公众如何感知和依赖不完美的机器翻译对于将MT研究置于现实应用中至关重要。 Method: 在公共博物馆开展了一项针对452名参与者的用户研究,分析流利性和充分性错误对用户依赖机器翻译的影响。 Result: 非双语用户由于缺乏评估策略和其他选择,往往过度依赖机器翻译;而亲身体验翻译错误会影响他们未来的使用依赖。 Conclusion: 需要加强机器翻译评估和自然语言处理解释技术,以提升机器翻译质量,并增强用户的机器翻译素养。 Abstract: As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users' reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.[29] MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
Jianjin Wang,Runsong Zhao,Xiaoqian Liu,Yuan Ge,Ziqiang Xu,Tong Xiao,Shengxiang Gao,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: 本文提出了一种在语音到单元翻译(S2UT)模型中引入多令牌预测(MTP)损失的方法,通过在中间层应用MTP-S2UT损失来增强隐藏表示的信息密度,从而提升翻译质量。
Details
Motivation: 由于单个语音令牌语义稀疏,现有方法难以有效表达完整语义单元,因此需要一种机制来提高每个位置的信息密度和语义完整性。 Method: 将多令牌预测(MTP)损失应用于S2UT模型的中间层(即CTC损失计算处的隐藏表示),提出MTP-S2UT损失,以提前并更有效地增强隐藏表示。 Result: 实验表明,所有MTP损失变体均能持续提升S2UT翻译质量,其中MTP-S2UT性能最佳。 Conclusion: 在中间层引入MTP损失可有效提升S2UT模型的语义表达能力和翻译性能,验证了早期信息增强的有效性。 Abstract: Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.[30] Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
Shu Zhao,Tan Yu,Anbang Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为ExpandSearch的LLM-based搜索代理,通过强化学习实现查询扩展,并结合预训练的squeezer模型提升对检索文档的理解能力,从而在多跳问答任务中实现了优于现有方法的性能。
Details
Motivation: 现有的推理增强搜索代理在多跳问答任务上的表现仍不理想,主要受限于其推理和搜索能力,且难以在有限资源下同时掌握查询生成、信息理解和答案生成等多个任务。 Method: 采用强化学习训练具备查询扩展能力的LLM-based搜索代理,在每一轮生成多个查询变体以提高检索召回率;引入预训练的squeezer模型来帮助理解检索到的文档,使搜索代理专注于查询生成。 Result: 在七个问答基准上的实验表明,ExpandSearch平均比现有最先进基线方法提升了4.4%的准确率,在需要多样化证据聚合的多跳推理任务上表现尤为突出。 Conclusion: 通过引入查询扩展机制和squeezer模型辅助,即使小规模(3B)的LLM也能在多跳问答任务中达到最先进的性能,验证了该方法的有效性和可扩展性。 Abstract: Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.[31] Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety
Yuyi Huang,Runzhe Zhan,Lidia S. Chao,Ailin Tao,Derek F. Wong
Main category: cs.CL
TL;DR: 本文提出了长思维链(Long-CoT)推理中的一种新型安全漏洞——“路径漂移”(Path Drift),即模型推理过程可能偏离对齐路径,产生违反安全约束的内容。作者识别出三种引发路径漂移的行为机制,并提出一个三阶段诱导框架,同时设计了基于角色归因修正和元认知反思的防御策略。
Details
Motivation: 尽管大型语言模型在推理任务中广泛应用,但现有对齐技术(如RLHF)难以保证长思维链推理全过程的安全性。本文旨在揭示Long-CoT中潜在的路径漂移风险,强调需从推理轨迹层面进行安全监控,而不仅是逐token对齐。 Method: 通过实证分析识别路径漂移的三种行为触发机制:第一人称承诺、道德蒸发和条件链升级;构建包含认知负荷增强、自我角色引导和条件链劫持的三阶段诱导框架;提出路径级防御策略,引入角色归因修正与元认知反思机制。 Result: 实验表明所提出的三阶段框架能显著降低模型拒绝率,各阶段独立有效且组合后效应叠加;引入反射性安全提示可有效缓解路径漂移现象。 Conclusion: 长期推理过程中,模型的安全对齐可能随推理路径展开而退化,需建立针对整个推理轨迹的动态对齐机制,强化路径级别的安全控制。 Abstract: As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.[32] Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default
Jiaqi Liu,Lanruo Wang,Su Liu,Xin Hu
Main category: cs.CL
TL;DR: 在严格成本、延迟和隐私限制的医疗场景中,研究发现使用轻量级模型DistilBERT结合标准交叉熵损失函数在医学摘要分类任务中表现最佳,推荐作为实践默认方案。
Details
Motivation: 大型语言模型虽在许多NLP任务中表现良好,但在医疗环境中因成本、延迟和隐私限制难以部署,因此需要探索轻量级模型的有效性。 Method: 在公共医学摘要语料库上,固定分词器、序列长度、优化器和训练计划,微调BERT base和DistilBERT,比较标准交叉熵、类别加权交叉熵和焦点损失三种目标函数的效果。 Result: DistilBERT结合标准交叉熵在测试集上取得了最佳平衡,且参数量远少于BERT base;报告了准确率、Macro F1和Weighted F1,并进行了混淆分析以揭示错误模式。 Conclusion: 建议从紧凑编码器和交叉熵开始作为实用默认策略,再根据需要添加校准和任务特定检查,最后才考虑更复杂的模型。 Abstract: Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.[33] HUME: Measuring the Human-Model Performance Gap in Text Embedding Task
Adnan El Assadi,Isaac Chung,Roman Solomatin,Niklas Muennighoff,Kenneth Enevoldsen
Main category: cs.CL
TL;DR: 本文提出了HUME框架,用于评估文本嵌入模型中的人类表现,并在16个MTEB数据集上进行人类与模型性能的比较,发现人类平均得分为77.6%,最佳模型为80.1%,揭示了不同任务和低资源语言中的模型局限性。
Details
Motivation: 由于现有嵌入模型评估缺乏可靠的人类表现基准,难以准确解释模型性能,因此需要一个能够衡量人类在嵌入任务上表现的框架。 Method: 构建HUME人类评估框架,在16个MTEB数据集上测量人类在重排序、分类、聚类和语义相似度等任务上的表现,并与最先进的嵌入模型进行对比。 Result: 人类在这些任务上的平均表现达到77.6%,最优模型为80.1%;模型在某些任务上接近人类水平,但在低资源语言和部分任务上表现较差,暴露出数据集和模型的不足。 Conclusion: HUME提供了人类性能基线和任务难度洞察,增强了模型评分的可解释性,有助于改进嵌入模型和未来基准测试的设计。 Abstract: Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, although variation is substantial: models reach near-ceiling performance on some datasets while struggling on others, suggesting dataset issues and revealing shortcomings in low-resource languages. We provide human performance baselines, insight into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of the model and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.[34] CLMN: Concept based Language Models via Neural Symbolic Reasoning
Yibo Yang
Main category: cs.CL
TL;DR: 提出概念语言模型网络(CLMN),结合神经表示与符号推理,在保持高性能的同时提升NLP模型的可解释性。
Details
Motivation: 深度学习在NLP中缺乏可解释性,尤其是在医疗和金融领域;现有概念瓶颈模型在文本表示或语义保持上存在缺陷,且难以建模概念间的动态交互。 Method: CLMN将概念表示为连续、可读的嵌入,并采用模糊逻辑推理学习自适应的概念交互规则,增强文本特征并自动诱导可解释的逻辑规则。 Result: 在多个数据集和预训练模型上,CLMN在准确率上优于现有基于概念的方法,同时提升了解释质量。 Conclusion: 在统一的概念空间中融合神经表示与符号推理,能够构建兼具性能与透明性的实用NLP系统。 Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.[35] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Hua Cai,Shuang Zhao,Liang Zhang,Xuli Shen,Qing Xu,Weilin Shen,Zihao Wen,Tianke Ban
Main category: cs.CL
TL;DR: 本文提出了一个专为法律推理设计的轻量级大语言模型Unilaw-R1,通过构建高质量的链式思维数据集并采用两阶段训练策略,在降低部署成本的同时显著提升了在复杂法律任务上的表现。
Details
Motivation: 现有大语言模型在处理复杂法律问题方面存在法律知识不足、推理逻辑不可靠和业务泛化能力弱的问题,且专门针对法律推理的模型研究尚不充分。 Method: 构建包含17K条链式思维样本的高质量数据集Unilaw-R1-Data,并采用监督微调与强化学习相结合的两阶段训练策略;同时提出专用评测基准Unilaw-R1-Eval。 Result: Unilaw-R1在权威基准上表现优异,性能超过所有同规模模型,接近320亿参数的DeepSeek-R1-Distill-Qwen-32B(达到其54.9%的表现),并在LawBench和LexEval上平均超越Qwen-2.5-7B-Instruct 6.6%。 Conclusion: Unilaw-R1以较小参数规模实现了高效的法律推理能力,验证了领域定制化训练和高质量推理数据对提升法律AI性能的有效性。 Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.[36] A-IPO: Adaptive Intent-driven Preference Optimization
Wenqing Wang,Muhammad Asif Ali,Ali Shoker,Ruohan Yang,Junyang Chen,Ying Sha,Huan Wang
Main category: cs.CL
TL;DR: 本文提出了一种新的偏好优化方法A-IPO,通过引入意图模块来推断用户提示背后的潜在意图,并将其显式地融入奖励函数中,从而实现模型响应与用户意图之间的更强对齐。
Details
Motivation: 现有对齐方法如DPO往往忽略少数群体观点,未能捕捉用户提示中的潜在意图。因此需要一种能够适应多样化和动态人类偏好的新方法。 Method: A-IPO引入一个意图模块来推断每个用户提示的潜在意图,并将意图-响应相似性项纳入奖励函数,理论上证明其可增加偏好边际(在对数几率上正向偏移λΔsim)。 Result: 在Real-pref、Attack-pref和GlobalOpinionQA-Ext三个数据集上的实验表明,A-IPO在多个关键指标上显著优于现有基线方法,包括最高+24.8胜率和+45.6响应-意图一致性、+38.6响应相似性和+52.2防御成功率、以及+54.6意图一致性得分。 Conclusion: A-IPO通过显式建模多样化的用户意图,实现了多元化的偏好优化,同时增强了偏好对齐中的对抗鲁棒性,有效解决了传统方法忽视少数意见和潜在意图的问题。 Abstract: Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model's responses and the user's underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention--response similarity term increases the preference margin (by a positive shift of $\lambda\,\Delta\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.[37] Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers
Parthiv Chatterjee,Shivam Sonawane,Amey Hengle,Aditya Tanna,Sourish Dasgupta,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 提出了一种名为PerAugy的数据增强技术,通过跨轨迹混洗和摘要内容扰动来提升个性化摘要模型的性能,显著提高了用户编码器的准确性和个性化程度。
Details
Motivation: 由于缺乏包含用户偏好历史和目标摘要的多样化训练数据,个性化摘要模型的训练面临挑战,尤其是现有数据集如MS/CAS PENS缺少目标摘要且主题转换多样性有限。 Method: 提出PerAugy方法,结合跨轨迹混洗和摘要内容扰动进行数据增强,并引入三个数据集多样性指标(TP、RTC、DegreeD)来量化增强后的多样性。 Result: 在四个最先进的用户编码器上验证了PerAugy的有效性,AUC最高提升0.132;两个SOTA摘要框架结合改进的用户编码器后,个性化效果平均提升61.2%(PSE-SU4指标)。多样性分析显示TP和DegreeD与模型性能强相关。 Conclusion: 数据增强中引入的多样性是提升个性化摘要模型性能的关键因素,PerAugy为解决训练数据稀缺和多样性不足提供了有效方案。 Abstract: Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose $\mathrm{PerAugy}$, a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: $\text{61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by \peraugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and \degreed\ to quantify the induced diversity. We find that $\mathrm{TP}$ and $\mathrm{DegreeD}$ strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.[38] Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning
Renliang Sun,Wei Cheng,Dawei Li,Haifeng Chen,Wei Wang
Main category: cs.CL
TL;DR: 本文提出了REFRAIN,一种无需训练的框架,通过自适应地决定何时停止推理来缓解大语言模型在链式思维推理中的“过度思考”问题。
Details
Motivation: 过度或冗余的推理会增加推理成本并导致错误结论,因此需要一种机制来避免过量推理。 Method: REFRAIN结合了一个两阶段停止判别器以识别反思性但冗余的推理,并采用滑动窗口上置信界(SW-UCB)多臂老虎机控制器动态调整停止阈值。 Result: 在四个基准任务和两个模型家族上,REFRAIN减少了20%-55%的token使用量,同时保持或提升了准确性。 Conclusion: REFRAIN有效实现了“恰到好处”的推理,揭示了“何时停止”作为测试时扩展的一个新且实用的维度。 Abstract: Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.[39] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Luyao Zhuang,Shengyuan Chen,Yilin Xiao,Huachi Zhou,Yujing Zhang,Hao Chen,Qinggang Zhang,Xiao Huang
Main category: cs.CL
TL;DR: 提出LinearRAG框架,通过构建无关系的分层Tri-Graph实现高效、可靠的检索增强生成,避免了传统图方法中不稳定的关系抽取。
Details
Motivation: 传统基于知识图谱的RAG方法依赖不稳定且昂贵的关系抽取,导致图结构噪声多、检索质量下降,难以有效处理大规模非结构化文本中的复杂推理任务。 Method: 提出LinearRAG,使用轻量级实体提取和语义链接构建无关系的分层Tri-Graph,并采用两阶段检索策略:局部语义桥接激活相关实体,再通过全局重要性聚合进行段落检索。 Result: 在四个数据集上的实验表明,LinearRAG在检索效果上显著优于基线模型,且具有线性可扩展性和低开销优势。 Conclusion: LinearRAG提供了一种经济、可靠的知识索引与检索方式,有效提升了复杂查询下的检索性能,为图增强RAG提供了新范式。 Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.[40] Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Zilong Wang,Xiaoyu Shen
Main category: cs.CL
TL;DR: 提出一种结合OCR和大语言模型的自适应框架,通过利用文档特定特征,在信息提取任务中实现高准确率与高效处理的平衡。
Details
Motivation: 解决企业文档处理中复制密集型文档信息提取的挑战,现有方法难以在准确性和效率之间取得良好平衡。 Method: 设计并评估25种配置,涵盖三种提取范式(直接、替换和基于表格),采用格式感知路由策略,结合OCR引擎与大语言模型进行智能策略选择。 Result: 在结构化文档上达到F1=1.0(延迟0.97秒),在图像输入上F1=0.997(延迟0.6秒),相比传统多模态方法性能提升54倍,保持亚秒级处理速度。 Conclusion: 重复性高的文档处理任务可通过结构感知的方法选择转化为优化机会,而非计算负担,为大规模异构文档流处理提供可扩展方案。 Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.[41] DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models
Tingxu Han,Wei Song,Ziqi Ding,Ziming Li,Chunrong Fang,Yuekang Li,Dongfang Liu,Zhenyu Chen,Zhenting Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为DiffHeads的轻量级去偏框架,通过分析直接回答(DA)和思维链(CoT)提示下注意力头的差异激活来识别并屏蔽偏差头,有效降低了大语言模型在不同提示下的不公平性,同时保持模型效用。
Details
Motivation: 现有研究缺乏对大语言模型产生偏见输出机制的深入理解,导致去偏方法脆弱且效果有限,因此需要系统性探究不公平性的成因并提出更鲁棒的解决方案。 Method: 比较DA与CoT提示在多种LLM上的表现,定义token-to-head贡献分数以追踪注意力头对偏见的影响,并基于DA与CoT之间的差异激活分析识别偏差头,进而设计仅屏蔽这些头的DiffHeads框架。 Result: DA提示使测量到的不公平性增加391.9%-534.5%;发现一小簇在DA下激活而在CoT下沉默的“偏差头”;DiffHeads在DA和CoT下分别将不公平性降低49.4%和40.3%,且不损害模型性能。 Conclusion: 提示策略会显著影响LLM中偏差头的激活,从而引发不公平输出;通过差分分析识别并屏蔽特定偏差头是一种有效且轻量的去偏方法,为理解与缓解LLM偏见提供了新的因果视角。 Abstract: Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.[42] BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation
Tsung-Min Pai,Jui-I Wang,Li-Chun Lu,Shao-Hua Sun,Hung-Yi Lee,Kai-Wei Chang
Main category: cs.CL
TL;DR: BILLY是一种无需训练的框架,通过在模型激活空间中融合多个不同的人格向量,模拟多大语言模型协作的优势,从而提升单个模型的创造力,同时显著降低推理时间和计算成本。
Details
Motivation: 多LLM系统虽然能增强创造力,但存在计算成本高和推理延迟大的问题,因此需要一种更高效的替代方案。 Method: 提出BILLY框架,通过提取并在激活空间中融合多个不同的人格向量,在推理时利用合并后的向量引导生成过程,实现多视角输出。 Result: 在多个面向创造力的基准测试中,BILLY优于单模型提示和传统的多LLM方法,同时大幅减少推理时间和计算开销,并展现出对生成过程的有效控制和更高的可解释性。 Conclusion: BILLY能够在不牺牲创造力的前提下,有效克服多LLM系统的高成本和延迟问题,为高效创意生成提供了一种新思路。 Abstract: Multi-LLM systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi-LLM collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by extracting and blending multiple distinct persona vectors directly in the model's activation space. We steer the model's generation process with this merged vector while inference, enabling multi-perspective output without explicit multi-LLM communication. Our experiments across creativity-oriented benchmarks demonstrate that BILLY surpasses single model prompting and traditional multi-LLM approaches, while substantially reducing inference time and computational costs. Our analyses further reveal that distinct persona vectors can be blended to achieve both effective control over complementary aspects of generation and greater interpretability.[43] BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet,Abdellah Fourtassi,Akari Haga,Bastian Bunzeck,Bhargav Shandilya,Diana Galvan-Sosa,Faiz Ghifari Haznitrama,Francesca Padovani,Francois Meyer,Hai Hu,Julen Etxaniz,Laurent Prévot,Linyang He,María Grandury,Mila Marcheva,Negar Foroutan,Nikitas Theodoropoulos,Pouya Sadeghi,Siyuan Song,Suchir Salhan,Susana Zhou,Yurii Paniv,Ziyin Zhang,Arianna Bisazza,Alex Warstadt,Leshem Choshen
Main category: cs.CL
TL;DR: BabyBabelLM是一个多语言数据集集合,模拟个体从出生到习得母语期间所接触的语言环境,涵盖45种语言各相当于1亿英语词汇的发育合理的预训练数据,并提供评估套件和基线模型,旨在促进多语言预训练和认知建模。
Details
Motivation: 为了支持多语言环境下的语言习得研究和认知建模,需要构建更贴近真实语言发展过程的预训练数据。 Method: 收集并整理了45种语言的发育合理预训练数据,每种语言约相当于1亿英语词汇量,并为每种语言构建了评估套件,训练了基线模型。 Result: 成功构建了覆盖45种语言的BabyBabelLM数据集,提供了统一的评估框架和基线模型,验证了其在多语言预训练和认知建模中的潜力。 Conclusion: BabyBabelLM为多语言语言习得研究和认知建模提供了一个新的、更具生态效度的数据资源。 Abstract: We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.[44] Large Language Model Sourcing: A Survey
Liang Pang,Kangxi Wu,Sunhao Dai,Zihao Wei,Zenghao Duan,Jia Gu,Xiang Li,Zhiyi Yin,Jun Xu,Huawei Shen,Xueqi Cheng
Main category: cs.CL
TL;DR: 本文系统研究了大语言模型(LLM)生成内容的来源追踪问题,从模型和数据两个视角出发,提出四维框架与基于先验/后验的双重分类法,旨在提升LLM在实际应用中的透明性、问责性和可信度。
Details
Motivation: 由于大语言模型具有黑箱特性且生成内容类似人类创作,存在幻觉、偏见、不公平和版权侵权等风险,因此需要从多角度追踪内容来源以增强其可追溯性与可信度。 Method: 提出了一个涵盖模型来源、模型结构来源、训练数据来源和外部数据来源的四维框架,并引入先验-based(主动嵌入可追溯性)与后验-based(回溯推断)的双范式分类法对现有方法进行系统归类。 Result: 构建了一个全面的来源追踪体系,整合了模型与数据双重视角,实现了对LLM生成内容的多维度溯源,提升了模型输出的透明性和可审计性。 Conclusion: 该四维框架与双范式分类法为LLM的可追溯性研究提供了系统化路径,有助于推动负责任AI的发展,增强LLM在现实场景中的信任与合规部署。 Abstract: The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, shifting from supporting objective tasks (e.g., recognition) to empowering subjective decision-making (e.g., planning, decision). This marks the dawn of general and powerful AI, with applications spanning a wide range of fields, including programming, education, healthcare, finance, and law. However, their deployment introduces multifaceted risks. Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become particularly significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions that together capture both model- and data-centric perspectives. From the model perspective, Model Sourcing treats the model as a whole, aiming to distinguish content generated by specific LLMs from content authored by humans. Model Structure Sourcing delves into the internal generative mechanisms, analyzing architectural components that shape the outputs of model. From the data perspective, Training Data Sourcing focuses on internal attribution, tracing the origins of generated content back to the training data of model. In contrast, External Data Sourcing emphasizes external validation, identifying external information used to support or influence the responses of model. Moreover, we also propose a dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.[45] A Survey of Inductive Reasoning for Large Language Models
Kedi Chen,Dezhao Ruan,Yuhao Dan,Yaoting Wang,Siyu Yan,Xuecheng Wu,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Biqing Qi,Linyang Li,Qipeng Guo,Xiaoming Shi,Wei Zhang
Main category: cs.CL
TL;DR: 本文首次全面综述了大语言模型的归纳推理,将其改进方法分为后训练、测试时扩展和数据增强三类,总结了现有基准并提出基于沙盒和观察覆盖率的统一评估方法,同时分析了归纳能力的来源及其与模型架构和数据的关系。
Details
Motivation: 归纳推理是大语言模型的重要能力,具有由具体到一般的思维特点且答案不唯一,对知识泛化和人类认知模拟至关重要,但目前缺乏系统性总结。 Method: 将提升归纳推理的方法分为后训练、测试时扩展和数据增强三类;总结现有基准,并提出基于沙盒环境和观察覆盖率的统一评估方法。 Result: 提出了归纳推理的分类体系、统一评估框架,并通过分析揭示了简单模型架构和数据在归纳任务中的作用。 Conclusion: 该综述为大语言模型的归纳推理研究提供了系统性基础,指明了未来研究方向。 Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.[46] MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems
Lei Gu,Yinghao Zhu,Haoran Sang,Zixiang Wang,Dehao Sui,Wen Tang,Ewen Harrison,Junyi Gao,Lequan Yu,Liantao Ma
Main category: cs.CL
TL;DR: 该研究通过大规模实证分析揭示了基于大语言模型的多智能体系统在医疗诊断中的协作推理缺陷,指出仅依赖准确率评估存在不足,强调需建立透明、可审计的推理过程以提升医疗AI的可信度。
Details
Motivation: 现有对基于LLM的多智能体医疗系统的评估主要关注最终答案准确性,忽视其内部协作推理过程的可解释性与可靠性,导致在高风险医疗场景中可能存在不可信或错误的诊断结论。 Method: 研究者从六个医学数据集和六种代表性多智能体框架中收集3600个案例,采用混合方法(定性分析与定量审计)构建协作失败模式的综合分类体系,并识别主要失败模式。 Result: 定量审计发现四种主导性失败模式:由模型共性缺陷导致的错误共识、正确少数意见被压制、低效讨论动态以及信息合成过程中的关键信息丢失。 Conclusion: 仅靠高准确率不足以确保医疗AI系统的临床可靠性与公众信任,必须重视透明且可审计的推理路径,推动医疗AI负责任地发展与部署。 Abstract: While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque "black boxes" and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.[47] Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning
Bo Yuan,Yulin Chen,Yin Zhang
Main category: cs.CL
TL;DR: 提出Delora框架,通过引入干净和噪声LoRA解耦样本选择与模型训练,有效应对大语言模型在含噪声标签数据下的微调问题。
Details
Motivation: 现有方法在含噪声标签的数据上进行微调时,因初始样本选择不准确导致损失计算偏差,形成恶性循环,影响性能。 Method: 设计Delora框架,利用干净LoRA记忆干净数据、噪声LoRA记忆噪声数据,构建可学习的噪声标签检测器,实现样本选择与模型训练的解耦。 Result: 在合成和真实噪声数据集上的实验表明,Delora在噪声标签检测和文本分类任务中均优于现有方法。 Conclusion: Delora能有效打破噪声标签带来的训练循环依赖,提升参数高效微调在噪声环境下的鲁棒性和性能。 Abstract: Parameter-efficient fine-tuning (PEFT) large language models (LLMs) have shown impressive performance in various downstream tasks. However, in many real-world scenarios, the collected training data inevitably contains noisy labels. To learn from noisy labels, most solutions select samples with small losses for model training. However, the selected samples, in turn, impact the loss computation in the next iteration. An inaccurate initial selection can create a vicious cycle, leading to suboptimal performance. To break this cycle, we propose Delora, a novel framework that decouples the sample selection from model training. For sample selection, Delora establishes a noisy label detector by introducing clean and noisy LoRA. Benefiting from the memory effect, the clean LoRA is encouraged to memorize clean data, while the noisy LoRA is constrained to memorize mislabeled data, which serves as a learnable threshold for selecting clean and noisy samples. For model training, Delora can use carefully selected samples to fine-tune language models seamlessly. Experimental results on synthetic and real-world noisy datasets demonstrate the effectiveness of Delora in noisy label detection and text classification.[48] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
Yijie Xu,Huizai Yao,Zhiyu Guo,Weiyu Guo,Pengteng Li,Aiwei Liu,Xuming Hu,Hui Xiong
Main category: cs.CL
TL;DR: 本文提出了一种名为SyTTA的无标签测试时适应框架,通过结合输入困惑度和输出预测熵两种不确定性信号,在无需额外监督的情况下实现语言模型在分布偏移下的动态适应,在农业问答等专业领域显著提升了生成性能。
Details
Motivation: 大型语言模型在专业领域部署时面临训练数据与目标领域之间的分布偏移问题,而领域微调依赖昂贵且难以获取的标注数据,因此需要一种无需标签的测试时适应方法。 Method: SyTTA在推理时联合利用输入侧的困惑度(反映领域术语不匹配)和输出侧的预测熵(反映生成概率不稳定)作为互补的不确定性信号,动态调整模型行为,实现在无监督情况下的在线适应。 Result: 在多种模型架构和领域基准上,SyTTA均带来一致性能提升;在农业问答任务中,仅增加每查询4个额外token的情况下,Qwen-2.5-7B的Rouge-LSum指标提升超过120%。 Conclusion: SyTTA证明了无需标注样本即可有效实现语言模型的测试时适应,为标注资源稀缺的专业领域部署提供了可行方案。 Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.[49] Text2Token: Unsupervised Text Representation Learning with Token Target Prediction
Ruize An,Richong Zhang,Zhijie Nie,Zhanyu Wu,Yanzhao Zhang,Dingkun Long
Main category: cs.CL
TL;DR: 提出了一种新的无监督文本表示学习框架Text2Token,通过构建高质量的token目标分布进行生成式任务训练,在MTEB v2基准上表现优异。
Details
Motivation: 受文本表示与关键token对齐现象的启发,探索无监督文本表示学习中表示空间与词汇空间的关系。 Method: 基于token目标预测任务,提出Text2Token框架,利用数据驱动和模型衍生两种方法构建合成token目标。 Result: 在MTEB v2基准上性能与当前最先进的LLM2Vec相当,且发现表示空间和词汇空间在训练中协同优化。 Conclusion: Text2Token为无监督文本表示学习提供了新思路,验证了生成式任务在该领域的潜力。 Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.[50] ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement
Kangyang Luo,Yuzhuo Bai,Shuzheng Si,Cheng Gao,Zhitong Wang,Yingli Shen,Wenhao Li,Zhu Liu,Yufeng Han,Jiayi Wu,Cunliang Kong,Maosong Sun
Main category: cs.CL
TL;DR: 提出ImCoref-CeS框架,结合改进的监督模型ImCoref与大语言模型(LLM)作为验证代理,提升共指消解性能。
Details
Motivation: 现有共指消解方法在小模型监督学习与大语言模型能力之间存在权衡,如何有效融合二者优势尚不充分。 Method: 提出ImCoref-CeS框架:1)设计ImCoref,引入轻量级桥接模块、双仿射打分器和混合提及正则化以增强监督模型;2)利用LLM作为多角色校验-拆分代理,过滤无效提及并修正错误聚类。 Result: 实验表明ImCoref-CeS优于现有最先进方法,在多个指标上取得更好性能。 Conclusion: 结合强监督模型与LLM推理的协同框架能有效提升共指消解效果,为未来研究提供新方向。 Abstract: Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.[51] Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
Samir Abdaljalil,Erchin Serpedin,Khalid Qaraqe,Hasan Kurban
Main category: cs.CL
TL;DR: 提出Audit-of-Understanding(AoU)框架,通过分解、验证和基于已验证前提进行推理三阶段方法,有效缓解大语言模型因推理过程中依赖未经支持假设而导致的幻觉问题,在多个数学推理任务上显著提升准确性和可信度。
Details
Motivation: 大语言模型常生成看似合理但基于未经验证假设的推理路径,导致推理性幻觉,现有方法主要关注事实性幻觉或依赖事后验证,难以解决此类问题。 Method: AoU框架包含三个阶段:(1)将问题分解为候选假设;(2)审计这些假设的支持程度;(3)仅基于被验证的前提进行推理。形式化为后验约束推理,并结合选择性预测与拒绝学习理论。 Result: 在GSM8K、MultiArith和SVAMP数据集上,AoU相比思维链、自洽性等方法取得显著提升,最高提升达30%-45%,并展现出更强的推理忠实性。 Conclusion: AoU通过约束推理过程于已验证前提,有效减少推理诱导的幻觉,提供了理论保证与实际性能提升,增强了大模型推理的可靠性。 Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.[52] Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models
Liang Lin,Miao Yu,Moayad Aloqaily,Zhenhong Zhou,Kun Wang,Linsey Pang,Prakhar Mehrotra,Qingsong Wen
Main category: cs.CL
TL;DR: 提出一种无需先验知识的后门防御框架\ourmethod,通过注入已知触发器聚合后门表征,并进行恢复性微调,有效降低攻击成功率同时保持模型性能。
Details
Motivation: 现有LLM后门防御方法依赖于对触发器设置的不切实际假设,且难以应对未知后门,因此需要一种无需先验知识的通用防御机制。 Method: \ourmethod基于在已受感染模型中注入已知后门时,新旧后门会在表示空间中聚集的观察,采用两阶段方法:首先通过注入已知触发器聚合后门表征,然后进行恢复性微调以恢复正常输出功能。 Result: 实验表明,\ourmethod将多个基准上的平均攻击成功率降至4.41%,优于现有基线28.1%~69.3%;干净样本准确率和模型效用损失控制在0.5%以内,并对多种后门类型具有泛化能力。 Conclusion: \ourmethod是一种无需触发器先验知识、高效且通用的LLM后门防御框架,在显著抑制后门攻击的同时几乎不影响原始模型性能,适用于实际部署场景。 Abstract: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.[53] On the Entity-Level Alignment in Crosslingual Consistency
Yihong Liu,Mingyang Wang,François Yvon,Hinrich Schütze
Main category: cs.CL
TL;DR: 本文研究了多语言大模型在跨语言事实知识回忆中的一致性问题,提出实体对齐是关键因素,并通过引入主语的英文翻译到提示中,提出了两种有效方法(SubSub和SubInj),显著提升了事实回忆的准确性和一致性。
Details
Motivation: 多语言大模型在不同语言中应能一致地回忆事实知识,但现实中常出现不一致,其原因尚不清楚。本文旨在探究这种不一致的根源,特别是实体对齐在其中的作用。 Method: 通过评估实体级别(主语和宾语)的翻译任务来测试实体对齐情况,并分析其与跨语言一致性的关系;在此基础上提出SubSub和SubInj两种方法,在多种语言的提示中加入主语的英文翻译以增强对齐。 Result: 发现跨语言一致性与实体对齐高度相关,主语或宾语的错位常导致不一致;所提方法显著提高了多语言事实回忆的准确性和一致性;机制分析表明,这些干预通过模型内部的枢纽语言处理增强了概念空间中的实体表示对齐。 Conclusion: 实体对齐是影响多语言大模型事实知识一致性的关键因素,通过在提示中引入英文主语翻译可有效改善这一问题,为提升多语言事实预测提供了实用且有效的策略。 Abstract: Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency -- and its frequent failure -- remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.[54] MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning
Hongwei Chen,Yishu Lei,Dan Zhang,Bo Ke,Danxiang Zhu,Xuyi Chen,Yuxiang Lu,Zhengjie Huang,Shikun Feng,Jingzhou He,Yu Sun,Hua Wu,Haifeng Wang
Main category: cs.CL
TL;DR: 提出MatryoshkaThinking方法,通过递归利用模型的推理、验证和总结能力,在显著降低计算成本的同时保持最先进性能。
Details
Motivation: 现有测试时扩展方法(如DeepConf)虽有效但计算开销大,需更高效方案。 Method: 递归利用模型内在的推理、验证和总结能力,提升正确解的保留率,缩小Pass@k与Pass@1差距。 Result: 在AIME2025上仅用DeepConf 4%的计算量即达到99.79分,且在多个开源模型和多模态推理基准上验证了有效性与通用性。 Conclusion: MatryoshkaThinking为语言模型提供了高效、可扩展的测试时推理策略新思路。 Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model's intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.[55] Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy
Ananya Malik,Nazanin Sabri,Melissa Karnaze,Mai Elsherief
Main category: cs.CL
TL;DR: 该研究提出一个框架,用于分析大语言模型(LLM)在不同人口统计特征(年龄、文化、性别)交叉定义的用户群体中认知与情感共情的表现差异,发现模型共情能力受这些属性显著影响,且多重属性叠加可能逆转预期模式,尤其对儒家文化背景群体存在明显偏差,强调需设计更具包容性和公平性的共情感知LLM。
Details
Motivation: 由于情感体验受人口和文化背景影响,而LLM的对话能力依赖共情,因此需探究LLM是否能在不同用户群体间实现公平共情。 Method: 构建包含315种独特人物角色(组合年龄、文化和性别)的交叉性分析框架,在四个LLM上评估其认知与情感共情表现,并结合定量与定性方法分析模型响应模式。 Result: 人口统计属性显著影响LLM的共情反应;多重属性组合可能削弱或逆转共情趋势;整体反映现实共情模式,但对儒家文化群体等特定组存在偏差。 Conclusion: 为实现更公平和包容的交互,LLM设计必须考虑用户的人口多样性,发展具有共情意识的模型。 Abstract: Large Language Models' (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs' cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model's empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behaviour patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behaviour.[56] End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
Nam Luu,Ondřej Bojar
Main category: cs.CL
TL;DR: 本文提出了一种结合预训练语音编码器和大语言模型的端到端架构,可同时执行自动语音识别和语音翻译任务,在英德翻译上优于SeamlessM4T,并媲美级联系统。
Details
Motivation: 为了提升语音翻译的性能,探索将预训练语音编码器与大语言模型结合的端到端方法,以同时处理ASR和ST任务。 Method: 采用联合端到端架构,结合预训练的语音编码器和大语言模型,在英德语言对上进行实验验证。 Result: 在COMET^DA22指标上比SeamlessM4T最多提升8%,且性能可与Whisper+NLLB级联系统相匹配。 Conclusion: 所提出的端到端架构在语音翻译任务上表现优异,兼具高性能和多任务能力,具有替代级联系统的潜力。 Abstract: Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.[57] ASC analyzer: A Python package for measuring argument structure construction usage in English texts
Hakyung Sung,Kristopher Kyle
Main category: cs.CL
TL;DR: 本文介绍了一个名为ASC analyzer的Python工具包,用于自动标记论元结构构式(ASC)并计算50个相关指数,以系统评估二语写作中的ASC使用情况。
Details
Motivation: 现有衡量二语学习者ASC使用的工具缺乏可扩展性和系统性,难以有效评估语言熟练度。 Method: 开发了一个公开可用的Python包ASC analyzer,能够自动标注ASC并计算多样性、比例、频率及ASC-动词词元关联强度等50个指标,并通过双变量和多变量分析检验这些指标与二语写作成绩的关系。 Result: 该工具能有效量化ASC使用特征,并在实证分析中显示出与二语写作成绩的相关性。 Conclusion: ASC analyzer为二语习得研究提供了一个可扩展、系统的分析工具,有助于更深入地理解ASC使用与语言 proficiency 的关系。 Abstract: Argument structure constructions (ASCs) offer a theoretically grounded lens for analyzing second language (L2) proficiency, yet scalable and systematic tools for measuring their usage remain limited. This paper introduces the ASC analyzer, a publicly available Python package designed to address this gap. The analyzer automatically tags ASCs and computes 50 indices that capture diversity, proportion, frequency, and ASC-verb lemma association strength. To demonstrate its utility, we conduct both bivariate and multivariate analyses that examine the relationship between ASC-based indices and L2 writing scores.[58] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
Aashiq Muhamed,Leonardo F. R. Ribeiro,Markus Dreyer,Virginia Smith,Mona T. Diab
Main category: cs.CL
TL;DR: 本研究发现当前语言模型在RAG系统中基于错误上下文进行选择性拒绝回答的能力存在严重缺陷,提出了一种生成式评估方法RefusalBench,通过176种语言扰动策略构建动态测试集,并发布两个新基准以推动该关键安全能力的持续评估与改进。
Details
Motivation: 语言模型在检索增强生成(RAG)系统中面对错误上下文时的选择性拒绝能力对安全性至关重要,但现有模型在此方面表现不佳,且静态基准测试易被模型利用数据偏差而无法准确评估真实能力。 Method: 提出RefusalBench,一种基于程序化语言扰动的生成式评估框架,包含六个信息不确定性类别和三个强度级别的176种扰动策略,用于构建动态、抗记忆化的测试用例。 Result: 评估了30多个模型,发现拒绝能力由可分离的检测与分类技能组成;模型规模扩大或增强推理并未提升表现;表现出过度自信或过度谨慎的倾向;拒绝能力具有可训练性和对齐敏感性。 Conclusion: 选择性拒绝是一项关键但当前表现不足的安全能力,RefusalBench提供了更可靠、动态的评估路径,未来可通过针对性训练和对齐优化来提升该能力。 Abstract: The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.[59] AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
Kai Zhang,Xinyuan Zhang,Ejaz Ahmed,Hongda Jiang,Caleb Kumar,Kai Sun,Zhaojiang Lin,Sanat Sharma,Shereen Oraby,Aaron Colak,Ahmed Aly,Anuj Kumar,Xiaozhong Liu,Xin Luna Dong
Main category: cs.CL
TL;DR: 提出AssoMem框架,通过构建关联记忆图并融合多维检索信号,提升大规模记忆中的上下文感知召回性能。
Details
Motivation: 在语义密集场景中,现有方法依赖语义距离进行检索,难以实现准确召回,受人类联想记忆启发,需改进记忆检索机制。 Method: 构建关联记忆图,将对话话语与自动提取的线索关联,并采用基于自适应互信息的融合策略整合相关性、重要性和时间对齐等多维信号进行排序。 Result: 在三个基准数据集和新提出的MeetingQA数据集上,AssoMem consistently优于SOTA基线方法。 Conclusion: AssoMem在上下文感知的记忆召回方面具有优越性,尤其适用于语义密集的问答场景。 Abstract: Accurate recall from large scale memories remains a core challenge for memory augmented AI assistants performing question answering (QA), especially in similarity dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance aware ranking. Further, AssoMem integrates multi-dimensional retrieval signals-relevance, importance, and temporal alignment using an adaptive mutual information (MI) driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms SOTA baselines, verifying its superiority in context-aware memory recall.[60] STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models
Geunyeong Jeong,Juoh Sun,Seonghee Lee,Harksoo Kim
Main category: cs.CL
TL;DR: 提出了一种语义级别的知识编辑框架STEAM,通过在潜在空间中对齐语义锚点来增强更新知识的语义一致性与推理能力。
Details
Motivation: 现有知识编辑方法多关注于词元级别的似然优化,忽视了语义连贯性,导致编辑后的知识在模型内部表示中孤立存在,无法融入原有知识结构。 Method: STEAM首先识别目标表示作为语义锚点,然后在优化过程中通过对其损失引导被编辑事实的内部表示向这些锚点对齐,从而实现语义层面的知识编辑。 Result: 实验结果表明,STEAM能有效提升模型对编辑知识的推理能力和语义一致性,验证了潜在空间对齐在知识编辑中的重要性。 Conclusion: 通过语义锚点对齐机制,STEAM实现了更可靠、连贯的知识编辑,推动了无需重训练的事实更新技术的发展。 Abstract: Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model's latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose \textsc{Steam}, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model's knowledge structure. \textsc{Steam} first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that \textsc{Steam} improves model's ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing. The code is available at https://github.com/GY-Jeong/STEAM.[61] LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints
Federica Bologna,Tiffany Pan,Matthew Wilkens,Yue Guo,Lucy Lu Wang
Main category: cs.CL
TL;DR: 提出LongQAEval框架,基于医生标注的300个真实患者问题,比较粗粒度与细粒度评估在正确性、相关性和安全性上的表现,发现细粒度标注提升正确性一致性,少量句子标注即可达到类似粗粒度可靠性。
Details
Motivation: 长文本临床问答系统的评估耗资源且困难,需医学专业知识且人工判断一致性低,亟需适用于资源有限和高专业要求场景的评估方法。 Method: 基于医生对300个真实患者问题的回答进行标注,比较答案级(粗粒度)与句子级(细粒度)评估在正确性、相关性和安全性三个维度的表现,并分析不同标注粒度下的标注者间一致性(IAA)及成本效益。 Result: 细粒度标注提高正确性维度的IAA,粗粒度标注更利于相关性一致性,安全性判断仍不一致;仅标注少量关键句子即可获得与粗粒度相当的可靠性,显著降低成本。 Conclusion: LongQAEval为资源受限和高专业要求场景提供了高效可靠的临床QA评估框架,推荐根据评估目标选择合适粒度,并通过抽样句子标注平衡准确性与成本。 Abstract: Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.[62] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
Jingyi Chen,Zhimeng Guo,Jiyun Chun,Pichao Wang,Andrew Perrault,Micha Elsner
Main category: cs.CL
TL;DR: 该论文提出了一个名为LISTEN的基准,用于评估大型音频语言模型(LALMs)在情感理解中对词汇与声学线索的依赖。实验表明,当前LALMs主要依赖词汇信息,而未能有效利用声学线索,表现出“转录”而非“倾听”的行为。
Details
Motivation: 研究旨在探究大型音频语言模型是否真正处理声学信息,还是仅依赖文本内容来理解语音中的情感。 Method: 设计了一个受控基准LISTEN,通过分离词汇和声学线索,在多种情感场景(如线索一致、冲突、中性或缺失)下评估六种最先进的LALMs。 Result: 发现模型在词汇线索中性或缺失时倾向于预测‘中性’情感;在线索一致时提升有限;在线索冲突时无法正确识别情感;在纯副语言情境下表现接近随机。 Conclusion: 当前的LALMs在情感理解任务中主要依赖词汇语义,未能充分捕捉声学情感线索,LISTEN提供了一个评估多模态模型情感理解能力的系统框架。 Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict "neutral" when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely "transcribe" rather than "listen," relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.[63] RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation
Zhichao Xu,Minheng Wang,Yawei Wang,Wenqian Ye,Yuntao Du,Yunpu Ma,Yijun Tian
Main category: cs.CL
TL;DR: RECON是一种通过在推理过程中集成显式摘要模块来压缩证据的框架,有效减少了检索增强生成(RAG)系统的上下文长度,提升了训练和推理效率,并在问答任务上显著提升性能。
Details
Motivation: 传统的RAG系统在使用强化学习结合推理时,因检索到的文档过长且噪声多,导致上下文管理效率低下,增加成本并影响性能。 Method: 提出RECON框架,引入一个两阶段训练的摘要模块:首先在问答数据集上进行相关性预训练,然后通过来自专有大模型的多方面蒸馏确保事实性和清晰度,并将其集成到Search-R1流程中。 Result: RECON将总上下文长度减少了35%,提高了训练速度和推理延迟;在下游问答基准上表现更好,3B模型的EM分数平均提升14.5%,7B模型提升3.0%,尤其在多跳问答中表现突出。 Conclusion: 学习到的上下文压缩对于构建实用、可扩展且高性能的RAG系统至关重要,RECON为此提供了有效解决方案。 Abstract: Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35\%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5\% and the 7B model by 3.0\%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at https://github.com/allfornancy/RECON.[64] Steering Over-refusals Towards Safety in Retrieval Augmented Generation
Utsav Maskey,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 本文研究了检索增强生成(RAG)中大型语言模型的安全对齐导致的过度拒绝问题,提出了RagRefuse基准和SafeRAG-Steering方法以减少过度拒绝。
Details
Motivation: 安全对齐在大型语言模型中会导致对良性请求的过度拒绝,尤其是在检索增强生成(RAG)场景中,查询意图和检索上下文共同影响拒绝行为,亟需系统分析与解决方案。 Method: 构建了涵盖医学、化学和开放领域的RagRefuse基准,控制上下文污染模式和规模;提出SafeRAG-Steering,一种基于模型嵌入干预的方法,在推理时引导嵌入区域向安全输出区域靠近。 Result: 发现上下文排列/污染、查询与上下文领域、有害文本密度等因素会触发对良性查询的拒绝,且受模型特定对齐策略影响;SafeRAG-Steering有效减少了污染RAG流程中的过度拒绝,同时保留了合理的拒绝行为。 Conclusion: 通过引入领域分层基准和嵌入引导干预,可在不牺牲安全性的情况下缓解RAG场景中的过度拒绝问题,为安全对齐提供了更精细的控制手段。 Abstract: Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.[65] End-to-end Speech Recognition with similar length speech and text
Peng Fan,Wenping Wang,Fei Deng
Main category: cs.CL
TL;DR: 本文提出两种新的对齐方法(TIL和AXE损失)以解决语音识别中语音与文本长度不匹配的问题,并通过帧融合增强关键帧信息,在AISHELL-1和AISHELL-2子集上实现了至少86%的帧数压缩,性能优于先前工作。
Details
Motivation: 语音与文本长度的不匹配给自动语音识别带来挑战,传统CTC等方法在降采样时难以实现有效对齐,尤其在语音需压缩至接近文本长度时表现不佳。 Method: 提出时间独立损失(TIL)和基于编辑距离的对齐交叉熵损失(AXE),并通过加权融合关键帧与其前后两帧来增强关键帧信息。 Result: 在AISHELL-1和AISHELL-2数据集子集上实验表明,所提方法显著优于先前方法,帧数减少至少86%。 Conclusion: TIL和AXE损失能有效提升语音与文本的对齐质量,结合帧融合策略可在大幅压缩语音序列的同时保持识别性能,适用于长短相近的语音识别任务。 Abstract: The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86\% in the number of frames.[66] Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
Shaobo Wang,Cong Wang,Wenjie Fu,Yue Min,Mingquan Feng,Isabel Guan,Xuming Hu,Conghui He,Cunxiang Wang,Kexin Yang,Xingzhang Ren,Fei Huang,Dayiheng Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: 本文提出了EssenceBench,一种用于基准测试压缩的粗到精框架,通过遗传算法实现高效子集选择,在大幅减少样本量的同时保持模型排名的一致性。
Details
Motivation: 随着对模型能力评估需求的增长,基准套件规模不断扩大,但缺乏能同时保证预测准确性和排名一致性的系统性压缩框架。 Method: 将基准压缩建模为评分重构的优化问题,提出基于遗传算法的迭代搜索方法,结合适应度驱动的子集选择与归因驱动的样本选择,实现粗到精的压缩。 Result: 在HellaSwag数据集上,使用25倍更少的样本即可保持所有模型排名偏差在5%以内,使用200倍更少的样本仍可保持95%的排名一致性,且重构误差更低、效率更高。 Conclusion: EssenceBench有效解决了基准冗余问题,在显著压缩基准规模的同时,保持了模型性能评估的准确性与排名稳定性,优于现有方法。 Abstract: As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.[67] NIM: Neuro-symbolic Ideographic Metalanguage for Inclusive Communication
Prawaal Sharma,Poonam Goyal,Navneet Goyal,Vidisha Sharma
Main category: cs.CL
TL;DR: 提出一种基于神经符号AI的通用表意元语言,帮助低学术素养人群跨越数字鸿沟。
Details
Motivation: 解决低学术素养人群在数字通信中的障碍,缩小数字鸿沟。 Method: 结合大语言模型与自然语义元语言理论,通过神经符号AI将复杂语义分解为原子概念,并采用表意符号系统;通过与200多名半文盲参与者协作设计和验证。 Result: 系统达到80%以上的语义可理解性,学习曲线平缓,具有普适性和适应性。 Conclusion: 该框架能有效服务于教育水平有限的弱势群体,促进包容性数字交流。 Abstract: Digital communication has become the cornerstone of modern interaction, enabling rapid, accessible, and interactive exchanges. However, individuals with lower academic literacy often face significant barriers, exacerbating the "digital divide". In this work, we introduce a novel, universal ideographic metalanguage designed as an innovative communication framework that transcends academic, linguistic, and cultural boundaries. Our approach leverages principles of Neuro-symbolic AI, combining neural-based large language models (LLMs) enriched with world knowledge and symbolic knowledge heuristics grounded in the linguistic theory of Natural Semantic Metalanguage (NSM). This enables the semantic decomposition of complex ideas into simpler, atomic concepts. Adopting a human-centric, collaborative methodology, we engaged over 200 semi-literate participants in defining the problem, selecting ideographs, and validating the system. With over 80\% semantic comprehensibility, an accessible learning curve, and universal adaptability, our system effectively serves underprivileged populations with limited formal education.[68] FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
Qiran Zou,Hou Hei Lam,Wenhao Zhao,Yiming Tang,Tingting Chen,Samson Yu,Tianyi Zhang,Chang Liu,Xiangyang Ji,Dianbo Liu
Main category: cs.CL
TL;DR: 本文提出了FML-bench,一个用于评估自动机器学习研究代理的新基准,聚焦于8个多样化且基础的机器学习研究问题,强调探索广度而非深度,并提供了一个包含五个互补指标的综合评估框架。
Details
Motivation: 现有基准过于强调工程方面而忽视学术严谨性,任务多样性不足,偏重应用导向任务,难以扩展到真实研究场景,因此需要一个更全面、更具基础性的评估基准。 Method: 设计了FML-bench基准,涵盖8个基础且多样的机器学习研究问题,降低编码负担,支持扩展至真实GitHub仓库,并提出五项互补的评估指标以全面衡量代理性能。 Result: 在FML-bench上评估了当前最先进的自动研究代理,发现采用广泛探索策略的代理优于专注于狭窄但深入探索的代理。 Conclusion: 强调探索广度可能比单纯注重渐进式精炼带来更有效的研究成果,FML-bench为评估自动机器学习研究代理提供了更具现实性和学术价值的平台。 Abstract: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.[69] When or What? Understanding Consumer Engagement on Digital Platforms
Jingyi Wu,Junying Liang
Main category: cs.CL
TL;DR: 该研究通过LDA模型分析TED演讲内容,发现观众参与度更多受发布时间等时间动态影响,而非主题内容本身,挑战了内容特征主导流行度的传统假设。
Details
Motivation: 理解数字服务经济中驱动内容受欢迎的关键因素,解决创作者对受众偏好的误判问题。 Method: 采用潜在狄利克雷分配(LDA)模型分析大量TED演讲文本,比较创作者的主题供给与观众参与所反映的需求,并进行纵向分析。 Result: 发现创作者供给与观众需求之间存在持续错配,且时间动态对观众参与的影响强于主题内容。 Conclusion: 内容流行的驱动力不仅在于内容特征,更关键的是发布时机和情境因素,这对平台管理者、营销人员和内容创作者优化受众参与策略具有重要启示。 Abstract: Understanding what drives popularity is critical in today's digital service economy, where content creators compete for consumer attention. Prior studies have primarily emphasized the role of content features, yet creators often misjudge what audiences actually value. This study applies Latent Dirichlet Allocation (LDA) modeling to a large corpus of TED Talks, treating the platform as a case of digital service provision in which creators (speakers) and consumers (audiences) interact. By comparing the thematic supply of creators with the demand expressed in audience engagement, we identify persistent mismatches between producer offerings and consumer preferences. Our longitudinal analysis further reveals that temporal dynamics exert a stronger influence on consumer engagement than thematic content, suggesting that when content is delivered may matter more than what is delivered. These findings challenge the dominant assumption that content features are the primary drivers of popularity and highlight the importance of timing and contextual factors in shaping consumer responses. The results provide new insights into consumer attention dynamics on digital platforms and carry practical implications for marketers, platform managers, and content creators seeking to optimize audience engagement strategies.[70] Assessing Large Language Models for Structured Medical Order Extraction
A H M Rezaul Karim,Ozlem Uzuner
Main category: cs.CL
TL;DR: 该论文提出了一种基于大语言模型的医疗指令提取方法,使用未经领域微调的LLaMA-4 17B模型和少量上下文示例,在MEDIQA-OE 2025任务中取得了第5名的成绩,验证了提示工程在临床NLP任务中的有效性。
Details
Motivation: 医疗指令提取对于结构化临床信息、支持决策和实现自动化至关重要,但指令常分散于多种文本中,需从对话记录中准确提取类型、描述、原因和来源等结构化信息。 Method: 采用通用的指令调优大模型LLaMA-4 17B,不进行领域微调,仅通过单个上下文示例进行少样本学习,并结合有效的提示工程来完成医疗指令提取任务。 Result: 在MEDIQA-OE 2025任务中排名第五(共17队,105次提交),平均F1得分为37.76,在原因和来源提取上表现尤为突出。 Conclusion: 大型通用语言模型结合有效提示工程可作为专业临床NLP任务的强而可扩展的基线方法,无需领域特定微调即可取得良好效果。 Abstract: Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.[71] UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models
Guangxin He,Shen Nie,Fengqi Zhu,Yuankang Zhao,Tianyi Bai,Ran Yan,Jie Fu,Chongxuan Li,Binhang Yuan
Main category: cs.CL
TL;DR: 本文研究了扩散大语言模型(diffusion LLMs)在长上下文场景下的行为,提出通过对旋转位置编码(RoPE)进行简单修改,并结合有效的掩码策略,在无需从头训练的情况下将上下文窗口扩展至128K tokens,推出了UltraLLaDA模型,在长文本任务中显著优于无训练基线方法。
Details
Motivation: 扩散大语言模型在多种下游任务中展现出潜力,但其在长上下文场景下的表现尚不明确,亟需有效且高效的方法来扩展其上下文窗口。 Method: 通过改进旋转位置编码(RoPE)以适应扩散过程中的概率建模特性,并在后训练阶段比较不同的掩码策略,分析其对优化稳定性和长距离记忆能力的影响。 Result: 成功构建了具有128K token上下文窗口的扩散LLM——UltraLLaDA,在长上下文任务的实验评估中显著优于无需训练的基线方法,验证了位置编码扩展的关键作用。 Conclusion: 改进的RoPE扩展方法和合适的掩码策略是实现扩散LLM长上下文扩展的关键,为通过高效后训练获得128K级上下文提供了实用指导。 Abstract: Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.[72] VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction
Prawaal Sharma,Poonam Goyal,Vidisha Sharma,Navneet Goyal
Main category: cs.CL
TL;DR: 本文提出了一种基于对比学习的OCR方法VOLTAGE,利用自动字形特征推荐和聚类标签技术,结合数据增强,有效提升了低资源语言(如Takri及其他印度语系文字)的识别准确率,推动其数字化保存。
Details
Motivation: 濒危语言因缺乏数字包容而面临消亡风险,尤其是低资源语言缺乏无监督OCR技术,阻碍了其数字化进程。 Method: 采用基于对比学习的OCR框架VOLTAGE,引入自动字形特征推荐进行聚类标注,并通过图像变换和生成对抗网络(GAN)增强标注数据的多样性与数量。 Result: 在Takri文字上达到95%的印刷体和87%的手写体识别准确率,并在多种印度语系文字上验证了方法的通用性。 Conclusion: VOLTAGE为低资源语言提供了有效的OCR解决方案,有助于濒危语言的数字化保护与传承。 Abstract: UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered. Attrition of a language leads to loss of traditional wisdom, folk literature, and the essence of the community that uses it. It is therefore imperative to bring digital inclusion to these languages and avoid its extinction. Low resource languages are at a greater risk of extinction. Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion. We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling. We augment the labelled data for diversity and volume using image transformations and Generative Adversarial Networks. Voltage has been designed using Takri - a family of scripts used in 16th to 20th century in the Himalayan regions of India. We present results for Takri along with other Indic scripts (both low and high resource) to substantiate the universal behavior of the methodology. An accuracy of 95% for machine printed and 87% for handwritten samples on Takri script has been achieved. We conduct baseline and ablation studies along with building downstream use cases for Takri, demonstrating the usefulness of our work.[73] Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting
Heming Xia,Cunxiao Du,Rui Li,Chak Tou Leong,Yongqi Li,Wenjie Li
Main category: cs.CL
TL;DR: 本文提出了一种名为AdvPrompt的迭代优化框架,通过黑盒对抗提示来减少大推理模型(LRM)的“过度思考”,在不牺牲准确性的前提下显著缩短响应长度和降低token消耗。
Details
Motivation: 大推理模型在逐步推理中表现出色,但冗长的推理过程带来了高昂的计算开销和延迟,限制了其实际部署。因此,需要一种通用方法来缓解过度思考问题。 Method: 将开源和闭源的大推理模型视为黑盒通信器,设计AdvPrompt框架,从多个角度迭代生成高质量的对抗性提示,以引导模型生成更简洁且准确的回答。 Result: 实验表明,AdvPrompt在多个基准上显著减少了token使用:在GSM8K简单问题上对Qwen3系列模型实现了3倍响应长度缩减;在四个基准上平均减少约40%的token;对于闭源API,在MATH-500上使Claude-3.7减少35%、Gemini-2.5减少47%的token消耗。此外,该方法展现出跨模型规模和家族的良好泛化能力。 Conclusion: AdvPrompt展示了黑盒对抗提示作为提升大推理模型推理效率的一种实用而有效策略的巨大潜力,为降低LRM部署成本提供了新方向。 Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.[74] Detecting Hallucinations in Authentic LLM-Human Interactions
Yujie Ren,Niklas Gruhlke,Anne Lauscher
Main category: cs.CL
TL;DR: 本文提出了AuthenHallu,首个基于真实LLM-人类交互的幻觉检测基准,揭示了真实场景中幻觉的普遍性及其在数学等挑战性领域中的高发性,并探讨了使用基础LLM作为检测器的潜力与局限。
Details
Motivation: 现有幻觉检测基准多为人工构造,无法真实反映现实场景中LLM幻觉的特点,因此需要一个基于真实人机对话的更可靠的基准。 Method: 从真实的LLM-人类对话中选取并标注样本,构建AuthenHallu基准,并进行统计分析;同时探索使用基础LLM自身作为幻觉检测器的可行性。 Result: 统计显示31.4%的问答对存在幻觉,在数学等领域高达60.0%;实验表明当前LLM作为检测器仍有不足。 Conclusion: AuthenHallu更真实地反映了现实场景中的幻觉特征,为幻觉检测研究提供了更可靠的基准,同时指出当前LLM在检测任务上的性能仍有限。 Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.[75] BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman,Esteban Carlin,Hsing-Kuo Pao,Giovanni Beltrame,Ghaluh Indah Permata Sari,Yie-Tarng Chen
Main category: cs.CL
TL;DR: BitMar是一种量化多模态Transformer模型,通过引入类人情景记忆和1.58位编码器,在资源受限的边缘设备上实现高效的图像-文本生成与理解。
Details
Motivation: 现有视觉-语言模型因全精度骨干网络难以部署于边缘设备,且缺乏与激进量化结合的记忆增强机制。 Method: 采用1.58位文本和视觉编码器生成紧凑嵌入,结合固定大小的键值情景记忆进行检索;解码器使用逐层条件控制和带滑动窗口的注意力sink机制处理长序列。 Result: 在低延迟和小模型体积下实现了具有竞争力的图像描述生成和多模态理解性能,适合边缘部署。 Conclusion: BitMar通过量化与记忆增强的协同设计,有效平衡了生成质量、速度与资源消耗,为边缘设备上的多模态任务提供了可行方案。 Abstract: Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.[76] Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
Di Wu abd Shuaidong Pan
Main category: cs.CL
TL;DR: 提出一种基于时序大语言模型的动态主题演化建模框架,通过上下文嵌入、时间衰减函数和注意力机制捕捉主题随时间的变化,并在潜在主题空间中利用状态转移矩阵描述主题动态演化。
Details
Motivation: 传统主题模型难以有效捕捉文本中主题的动态演化过程,尤其在大规模语料中缺乏对时间连续性和语义一致性的统一建模。 Method: 利用大语言模型获取文本上下文嵌入,引入时间衰减函数和注意力机制调整不同时间段语义单元的重要性,将时序表示映射到潜在主题空间,并使用状态转移矩阵建模主题演化,通过联合优化目标约束语义建模与时间一致性。 Result: 在真实语料库上的实验表明,该框架能有效捕捉主题的生成、扩展与衰退过程,在多个指标上优于现有模型,提升了主题连贯性、多样性及时序稳定性。 Conclusion: 所提方法实现了语义表示与时间演化的统一建模,为大规模文本中的动态语义模式理解提供了系统解决方案,丰富了主题建模的研究范式,并支持多领域复杂文本分析任务。 Abstract: This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.[77] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization
Bowei He,Lihao Yin,Huiling Zhen,Shuqi Liu,Han Wu,Xiaokun Zhang,Mingxuan Yuan,Chen Ma
Main category: cs.CL
TL;DR: 本文研究了校准数据对大语言模型压缩后性能的影响,提出基于激活模式的分析框架,并设计了提升关键能力保持的校准数据筛选方法。
Details
Motivation: 现有研究缺乏对校准数据在组合特性与领域对应性方面对大模型压缩后不同能力影响的系统性探讨,尤其在复杂推理任务上的表现。 Method: 从激活模式的角度分析校准数据的代表性与多样性如何影响压缩效果,并提出一种基于此的校准数据筛选框架。 Result: 发现激活空间中的代表性和多样性是决定校准数据质量的关键因素,所提框架能有效提升压缩模型在数学解题和代码生成等复杂任务上的表现。 Conclusion: 校准数据的质量应由其在激活空间的分布特性决定,所提出的框架有助于更好地保留大语言模型在高阶推理能力方面的性能。 Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.[78] FactAppeal: Identifying Epistemic Factual Appeals in News Media
Guy Mor-Lan,Tamir Sheafer,Shaul R. Shenhav
Main category: cs.CL
TL;DR: 本文提出了“认知诉求识别”这一新任务,旨在识别事实陈述是否以及如何通过外部来源或证据进行锚定,并发布了包含3,226个英文新闻句子的注释数据集FactAppeal。
Details
Motivation: 现有研究多集中于声明检测与验证,缺乏对支持事实陈述的认知结构和证据基础的细致分析,因此需要一种能识别事实主张背后证据类型与来源特征的方法。 Method: 构建了一个手动标注的数据集FactAppeal,包含跨度级标注,标识事实陈述及其依赖的信息源,并细粒度标注来源类型、命名情况、角色、资质、引用方式等特征;使用20亿到90亿参数范围内的编码器和生成式解码器模型进行建模。 Result: 最佳模型(基于Gemma 2 9B)在宏观F1分数上达到0.73,验证了该任务的可行性。 Conclusion: FactAppeal数据集和支持模型为深入分析事实主张的可信性构建提供了有效工具,推动了认知诉求识别这一新方向的发展。 Abstract: How is a factual claim made credible? We propose the novel task of Epistemic Appeal Identification, which identifies whether and how factual statements have been anchored by external sources or evidence. To advance research on this task, we present FactAppeal, a manually annotated dataset of 3,226 English-language news sentences. Unlike prior resources that focus solely on claim detection and verification, FactAppeal identifies the nuanced epistemic structures and evidentiary basis underlying these claims and used to support them. FactAppeal contains span-level annotations which identify factual statements and mentions of sources on which they rely. Moreover, the annotations include fine-grained characteristics of factual appeals such as the type of source (e.g. Active Participant, Witness, Expert, Direct Evidence), whether it is mentioned by name, mentions of the source's role and epistemic credentials, attribution to the source via direct or indirect quotation, and other features. We model the task with a range of encoder models and generative decoder models in the 2B-9B parameter range. Our best performing model, based on Gemma 2 9B, achieves a macro-F1 score of 0.73.[79] You're Not Gonna Believe This: A Computational Analysis of Factual Appeals and Sourcing in Partisan News
Guy Mor-Lan,Tamir Sheafer,Shaul R. Shenhav
Main category: cs.CL
TL;DR: 本研究通过大规模比较CNN和福克斯新闻在新冠疫情和以色列-哈马斯战争期间的报道,发现两者在事实陈述和信息来源上存在显著差异:CNN更多引用专家和权威文件,强调正式权威;而福克斯新闻则偏好新闻报道和直接引语,揭示了不同媒体如何采用系统性差异的认知策略构建现实。
Details
Motivation: 尽管媒体偏见已被广泛研究,但关于事实报道背后的认知策略仍缺乏计算层面的探讨。本文旨在通过计算方法分析主流媒体在报道相同事件时所采用的不同知识建构策略。 Method: 采用文章匹配策略,对CNN和福克斯新闻在同一事件上的报道进行对照,并利用FactAppeal框架分析超过47万篇文章,以分离报道风格与选题偏差的影响。 Result: CNN的报道包含更多事实性陈述,且更常引用外部来源;CNN倾向于引用专家和权威文件以建立可信度,而福克斯新闻则更多依赖其他新闻报道和直接引语。 Conclusion: 不同党派倾向的媒体使用系统性不同的认知策略来构建事实,这种差异为理解媒体偏见提供了新的维度。 Abstract: While media bias is widely studied, the epistemic strategies behind factual reporting remain computationally underexplored. This paper analyzes these strategies through a large-scale comparison of CNN and Fox News. To isolate reporting style from topic selection, we employ an article matching strategy to compare reports on the same events and apply the FactAppeal framework to a corpus of over 470K articles covering two highly politicized periods: the COVID-19 pandemic and the Israel-Hamas war. We find that CNN's reporting contains more factual statements and is more likely to ground them in external sources. The outlets also exhibit sharply divergent sourcing patterns: CNN builds credibility by citing Experts} and Expert Documents, constructing an appeal to formal authority, whereas Fox News favors News Reports and direct quotations. This work quantifies how partisan outlets use systematically different epistemic strategies to construct reality, adding a new dimension to the study of media bias.[80] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation
Omid Reza Heidari,Siobhan Reid,Yassine Yaakoubi
Main category: cs.CL
TL;DR: AGENTIQL是一个受代理启发的多专家框架,通过模块化设计提升文本到SQL生成的准确性与可解释性,在Spider基准上使用14B模型达到86.07%的执行准确率。
Details
Motivation: 现有的单体架构在处理复杂推理和多样化模式时表现不佳,缺乏可解释性和灵活性。 Method: 提出AGENTIQL框架,包含用于问题分解的推理代理、生成子查询的编码代理、列选择的优化步骤,以及在效率与精度间平衡的自适应路由机制,部分步骤可并行执行。 Result: 在Spider基准上,AGENTIQL使用14B模型达到了86.07%的执行准确率,接近GPT-4基线(89.65%),同时提升了可解释性和透明度。 Conclusion: AGENTIQL通过模块化、多代理协作和自适应路由,为语义解析提供了一种高效、可扩展且可解释的解决方案,尤其适用于中小型开源LLM。 Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07\% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.[81] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
Zhengbo Zhang,Zhiheng Lyu,Junhao Gong,Hongzhu Yi,Xinming Wang,Yuxuan Zhou,Jiabing Yang,Ping Nie,Yan Huang,Wenhu Chen
Main category: cs.CL
TL;DR: 提出BrowserAgent,一种通过模拟人类浏览器操作(如点击、滚动、输入)直接与网页交互的智能体,采用两阶段训练和显式记忆机制,在少数据下显著优于Search-R1。
Details
Motivation: 现有方法依赖将网页转为静态文本,无法模拟真实人类浏览行为,限制了LLM在动态网页任务中的自主性和交互性。 Method: 构建基于Playwright的BrowserAgent,执行预定义浏览器动作;采用SFT+RFT两阶段训练提升泛化能力;引入显式记忆机制支持长程推理。 Result: BrowserAgent-7B在HotpotQA、2Wiki、Bamboogle等多跳问答任务上比Search-R1提升约20%,且训练数据更少。 Conclusion: BrowserAgent通过更贴近人类的交互方式和有效训练策略,成为更具交互性和可扩展性的网页智能体框架。 Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.[82] Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
Zhuowei Chen,Bowei Zhang,Nankai Lin,Tian Hou,Lianxi Wang
Main category: cs.CL
TL;DR: 提出了一种基于推理的多语言安全防护方法ConsistentGuard,通过推理增强可解释性,并通过对齐提升语言间的知识迁移能力,在少量训练样本下表现出优越性能。
Details
Motivation: 现有基于分类器的方法缺乏可解释性,且在低资源语言上表现不佳,需要更有效、可解释的多语言安全检测方法。 Method: 提出ConsistentGuard,采用基于推理的框架,结合语言对齐机制,提升多语言场景下的知识迁移和可解释性。 Result: 在三个数据集、六种语言上验证,仅用1000个训练样本即优于使用更多数据训练的大模型,展现出强可解释性和泛化能力。 Conclusion: ConsistentGuard是一种高效、可解释的多语言安全防护方法,适用于低资源语言场景,并推动了该领域的公开研究。 Abstract: Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.[83] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Zichun Yu,Chenyan Xiong
Main category: cs.CL
TL;DR: 本文提出了RePro,一种通过强化学习训练小型语言模型生成高质量、保真重述的网页回收方法,显著提升了大语言模型预训练的数据效率和性能。
Details
Motivation: 随着高质量预训练数据的枯竭,如何有效利用现有数据成为大语言模型发展的关键挑战。 Method: 设计了一个质量奖励和三个保真度奖励,使用强化学习训练一个4B参数的语言模型对DCLM-RefinedWeb中的72B token进行重述。 Result: 在400M和1.4B模型上的实验表明,RePro在22个下游任务上比仅使用原始数据的基线模型相对准确率提升4.7%-14.0%,且优于现有最先进方法ReWire及四倍数据量的有机基线。 Conclusion: RePro能以2-3倍的数据效率提升模型性能,为大语言模型预训练提供了一条高效可控的数据再利用路径。 Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.[84] Sarcasm Detection Using Deep Convolutional Neural Networks: A Modular Deep Learning Framework
Manas Zambre,Sarika Bobade
Main category: cs.CL
TL;DR: 提出了一种基于深度卷积神经网络和BERT的模块化深度学习框架,用于检测文本中的讽刺,结合情感分析、上下文嵌入、语言特征提取和情绪检测。
Details
Motivation: 讽刺在文本中因缺乏语气和肢体语言而难以识别,现有方法在捕捉语境和情感细微差别方面存在不足。 Method: 构建一个多层次的模块化深度学习架构,结合DCNN和BERT模型,整合情感分析、上下文嵌入、语言特征和情绪检测。 Result: 模型尚处于概念阶段,但展示了在聊天机器人和社会媒体分析等实际应用中的可行性。 Conclusion: 该框架有望提升文本讽刺检测的准确性,为自然语言理解系统提供更强大的语用分析能力。 Abstract: Sarcasm is a nuanced and often misinterpreted form of communication, especially in text, where tone and body language are absent. This paper proposes a modular deep learning framework for sarcasm detection, leveraging Deep Convolutional Neural Networks (DCNNs) and contextual models such as BERT to analyze linguistic, emotional, and contextual cues. The system integrates sentiment analysis, contextual embeddings, linguistic feature extraction, and emotion detection through a multi-layer architecture. While the model is in the conceptual stage, it demonstrates feasibility for real-world applications such as chatbots and social media analysis.[85] Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis
Wenqing Zhang,Trang Nguyen,Elizabeth A. Stuart,Yiqun T. Chen
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)在自动化提取明确的方法学信息方面表现接近人类专家,但在需要复杂推理的评估中准确性显著下降,需结合人工审查以提高证据合成的效率与严谨性。
Details
Motivation: 系统评价在整合科学证据中至关重要,但方法学信息的提取仍高度依赖人工。研究旨在评估当前大型语言模型在方法学评估中的自动化潜力及其与人类专家的一致性。 Method: 采用因果中介分析作为方法学领域代表,对180篇全文科学论文进行基准测试,比较最先进的大型语言模型与人类专家在方法学评估上的一致性、准确性和错误模式。 Result: 模型判断与人类高度相关(准确率相关0.71,F1相关0.97),在明确陈述的标准上接近人类水平;但在复杂推断任务中准确率下降达15%。模型常因关键词的表面语言线索误判方法严谨性,且文档越长准确率越低。 Conclusion: 当前大型语言模型擅长识别明确的方法特征,但在需深入解释的任务中仍需人工监督。结合自动化提取与专家重点审查是提升证据合成效率与质量的可行路径。 Abstract: Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.[86] HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon
James Ald Teves,Ray Daniel Cal,Josh Magdiel Villaluz,Jean Malolos,Mico Magtira,Ramon Rodriguez,Mideth Abisado,Joseph Marvin Imperial
Main category: cs.CL
TL;DR: 本研究提出了HiligayNER,这是首个用于菲律宾语种Hiligaynon的命名实体识别(NER)基线模型,基于超过8000个标注句子的语料库,并使用mBERT和XLM-RoBERTa进行微调,取得了超过80%的F1分数,且在跨语言任务中表现出良好的迁移能力。
Details
Motivation: Hiligaynon语因缺乏标注语料库和基线模型而在自然语言处理研究中被忽视,本文旨在填补这一空白,推动低资源语言的技术发展。 Method: 收集来自新闻、社交媒体和文学文本的超过8000个Hiligaynon语句并进行NER标注,采用mBERT和XLM-RoBERTa两个Transformer模型在该语料库上进行微调,构建HiligayNER模型。 Result: 两个模型在精确率、召回率和F1分数上均超过80%,跨语言评估显示在Cebuano和Tagalog上的良好迁移性能。 Conclusion: HiligayNER为Hiligaynon语的NLP研究提供了首个公开可用的NER基线模型,具有良好的性能和跨语言潜力,有助于促进菲律宾低资源语言的技术发展。 Abstract: The language of Hiligaynon, spoken predominantly by the people of Panay Island, Negros Occidental, and Soccsksargen in the Philippines, remains underrepresented in language processing research due to the absence of annotated corpora and baseline models. This study introduces HiligayNER, the first publicly available baseline model for the task of Named Entity Recognition (NER) in Hiligaynon. The dataset used to build HiligayNER contains over 8,000 annotated sentences collected from publicly available news articles, social media posts, and literary texts. Two Transformer-based models, mBERT and XLM-RoBERTa, were fine-tuned on this collected corpus to build versions of HiligayNER. Evaluation results show strong performance, with both models achieving over 80% in precision, recall, and F1-score across entity types. Furthermore, cross-lingual evaluation with Cebuano and Tagalog demonstrates promising transferability, suggesting the broader applicability of HiligayNER for multilingual NLP in low-resource settings. This work aims to contribute to language technology development for underrepresented Philippine languages, specifically for Hiligaynon, and support future research in regional language processing.[87] Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG
Zhichao Wang,Cheng Wan,Dong Nie
Main category: cs.CL
TL;DR: 本文综述了大语言模型在推理时扩展性能的技术,分为输出导向和输入导向两类方法。
Details
Motivation: 由于高质量训练数据的日益稀缺,研究重点从扩大模型和数据规模转向推理时计算扩展。 Method: 系统性地梳理了推理时扩展领域的技术,从输出聚焦(如思维链、搜索解码)和输入聚焦(如少样本学习、检索增强生成RAG)两个角度进行组织。 Result: 全面总结了包括多步推理、搜索策略、RAG及其子模块(查询扩展、检索、重排序等)在内的多种技术进展,并涵盖了多模态RAG的发展。 Conclusion: 推理时扩展已成为提升大模型性能的关键路径,未来的研究将更加依赖于部署阶段的计算效率与方法创新。 Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.[88] Toward Human-Centered Readability Evaluation
Bahar İlgen,Georges Hattab
Main category: cs.CL
TL;DR: 提出了一种新的健康文本简化评估框架HCRS,结合自动测量和人类反馈,从五个维度评估可读性,超越传统的表面指标。
Details
Motivation: 现有NLP评估指标(如BLEU、FKGL、SARI)仅捕捉表面特征,忽视了清晰度、可信度、语调、文化相关性和可操作性等以人为中心的可读性质量,尤其在高风险健康传播中存在严重不足。 Method: 基于人机交互(HCI)和健康传播研究,构建五维的人类中心可读性评分(HCRS)框架,并结合自动指标与结构化人类反馈,提出参与式评估流程及实证验证方案。 Result: 提出了HCRS框架,能够更全面地捕捉可读性的关系性和情境性特征,支持将人类反馈系统地整合到评估过程中。 Conclusion: HCRS有助于推动健康文本简化的评估从表面指标转向更贴近用户需求、期望和实际体验的综合性评价,促进开发更具可用性、尊重性和可信度的NLP系统。 Abstract: Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP), such as BLEU, FKGL, and SARI, mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users' needs, expectations, and lived experiences.[89] Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures
Mihir Gupte,Paolo Giusto,Ramesh S
Main category: cs.CL
TL;DR: 提出一种自底向上的线性化方法,用于将树状结构知识(如GitHub仓库)隐式汇总并应用于检索增强生成(RAG),在保持响应质量的同时显著减少检索文档数量。
Details
Motivation: 现有RAG方法在处理层次化结构数据(如代码仓库)时缺乏有效的知识表示方式,导致检索效率低下。 Method: 采用自底向上的方法对树状结构进行逐层线性化,生成各层级的隐式聚合摘要,并将其存入知识库与RAG结合使用。 Result: 相比直接使用原始非结构化代码的RAG方法,所提方法在响应质量相当的情况下,检索文档数量减少了68%以上。 Conclusion: 利用隐式线性化知识是一种高效且可扩展的策略,适用于处理复杂的层次化数据结构。 Abstract: Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.[90] Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Haeji Jung,Jinju Kim,Kyungjin Kim,Youjeong Roh,David R. Mortensen
Main category: cs.CL
TL;DR: 本文研究了在多语言NLP中,不同转录方式(如罗马化、音位转录和替换密码)对多语言模型性能的影响,发现罗马化在大多数评估设置中显著优于其他输入类型。
Details
Motivation: 探索共享书写系统、重叠词汇表和共享语音学对多语言模型性能的贡献程度。 Method: 通过使用三种转录方式(罗马化、音位转录、替换密码)以及正字法进行受控实验,并在命名实体识别(NER)和自然语言推断(NLI)两个下游任务上评估模型表现。 Result: 罗马化在8个评估设置中的7个显著优于其他输入类型,且分析表明与预训练语言共享更长的(子词)token有助于更好地利用模型。 Conclusion: 罗马化是提升多语言模型性能最有效的方法,主要得益于其促进子词层面的语言间迁移能力。 Abstract: Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on two downstream tasks -- named entity recognition (NER) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.[91] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models
Kaixuan Ren,Preslav Nakov,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出了DUAL-Bench,首个针对视觉语言模型(VLM)中过度拒绝和安全完成问题的多模态基准,旨在解决在图像伴随请求时的安全与实用性平衡问题。
Details
Motivation: 现有的安全机制容易导致模型对无害请求过度拒绝,尤其是在视觉模态中缺乏系统性评估基准,难以应对图文组合中的双用途场景。 Method: 构建了包含12类危害的DUAL-Bench基准,评估18个VLM在语义保持的视觉扰动下的表现,重点衡量其安全完成能力。 Result: 实验结果显示当前模型的安全完成率普遍偏低:GPT-5-Nano为12.9%,GPT-5系列平均7.9%,Qwen系列仅3.9%。 Conclusion: DUAL-Bench揭示了现有VLM在多模态安全对齐方面的不足,推动更细粒度、更灵活的安全策略发展,以实现安全与实用的平衡。 Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.[92] Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
Jiajing Guo,Kenil Patel,Jorge Piazentin Ono,Wenbin He,Liu Ren
Main category: cs.CL
TL;DR: 本文评估了六种轻量级、面向工业的测试时扩展策略和四种大语言模型(包括两种推理模型)在BIRD Mini-Dev基准上的Text2SQL性能,发现分而治之提示和少样本示例能持续提升性能,但额外流程步骤效果不一,基础模型选择至关重要。
Details
Motivation: 尽管测试时扩展策略在大语言模型中展现出潜力,但其在真实场景尤其是最新推理模型中的有效性尚不明确,因此需要系统评估其在工业级Text2SQL系统中的实际表现。 Method: 在BIRD Mini-Dev基准上对六种测试时扩展策略和四种大语言模型(含两种推理模型)进行基准测试,并报告准确率、推理延迟和令牌消耗。 Result: 分而治之提示和少样本示例显著提升各类模型性能;引入额外工作流步骤效果不稳定;基础模型的选择对整体表现影响显著。 Conclusion: 在部署Text2SQL系统时,需权衡准确性、效率与复杂性,合理选择模型和轻量级扩展策略是关键。 Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.[93] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System
Yu Chao,Siyu Lin,xiaorong wang,Zhu Zhang,Zihan Zhou,Haoyu Wang,Shuo Wang,Jie Zhou,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: LLM x MapReduce-V3 是一个基于多智能体架构的层次化模块化系统,用于生成长篇综述,通过 MCP 服务器模块化设计提升可控性与内容深度。
Details
Motivation: 为了提升自动综述生成系统的灵活性、可干预性和内容质量,解决现有方法在结构控制和深度上的不足。 Method: 采用模型-上下文-协议(MCP)服务器将功能模块(如骨架初始化、摘要构建、骨架优化)解耦,通过高层规划智能体动态调度模块,实现分层模块化协同与人类介入。 Result: 系统在人类评估中相比基线模型在内容深度和长度上表现更优,支持多轮交互以精确捕捉研究视角,并生成全面深入的综述。 Conclusion: 基于MCP的模块化设计有效提升了综述生成系统的组织能力与定制化水平,验证了层次化多智能体架构在复杂文本生成任务中的优势。 Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.[94] ADVICE: Answer-Dependent Verbalized Confidence Estimation
Ki Jung Seo,Sehun Lim,Taeuk Kim
Main category: cs.CL
TL;DR: 本文提出了一种新的微调框架ADVICE,用于改善大语言模型在表达自信时的校准问题,通过增强答案依赖性来减少过度自信现象。
Details
Motivation: 大语言模型虽然能用自然语言表达其置信度,但常表现出过度自信,且原因尚不明确。为了提高模型透明度和可靠性,需要深入分析其自信表达的动态机制。 Method: 提出ADVICE(Answer-Dependent Verbalized Confidence Estimation)框架,通过微调使模型的置信度估计与其自身答案相关联,从而实现基于答案的置信度建模,并进行广泛实验验证。 Result: 实验证明ADVICE显著提升了置信度校准效果,同时保持任务性能;分析显示该方法增强了置信度与答案之间的关联性,使置信分布更平衡、更可靠。 Conclusion: ADVICE有效缓解了大语言模型的过度自信问题,揭示了自信表达偏差的根源,并为构建更可信的置信度表达机制提供了可行框架。 Abstract: Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model's failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.[95] GapDNER: A Gap-Aware Grid Tagging Model for Discontinuous Named Entity Recognition
Yawen Yang,Fukun Ma,Shiao Meng,Aiwei Liu,Lijie Wen
Main category: cs.CL
TL;DR: 提出了一种新的不连续命名实体识别模型GapDNER,通过引入上下文间隙的表示学习来解决解码歧义问题,并在三个数据集上实现了最先进的性能。
Details
Motivation: 现有的不连续实体识别方法由于片段组合多样,存在误差传播和解码歧义的问题。 Method: 将上下文间隙视为一种额外的跨度类型,转化为词对网格标注任务;设计了内部规则提取模块和跨片段关系增强模块,分别利用双仿射机制与线性注意力、交叉注意力建模词对网格特征。 Result: 在三个基准数据集上的实验表明,GapDNER在不连续NER任务上达到了当前最优性能,尤其在复杂实体结构识别上表现突出。 Conclusion: GapDNER通过显式建模上下文间隙有效提升了不连续命名实体识别的准确性和鲁棒性,为处理非连续、重叠实体提供了新思路。 Abstract: In biomedical fields, one named entity may consist of a series of non-adjacent tokens and overlap with other entities. Previous methods recognize discontinuous entities by connecting entity fragments or internal tokens, which face challenges of error propagation and decoding ambiguity due to the wide variety of span or word combinations. To address these issues, we deeply explore discontinuous entity structures and propose an effective Gap-aware grid tagging model for Discontinuous Named Entity Recognition, named GapDNER. Our GapDNER innovatively applies representation learning on the context gaps between entity fragments to resolve decoding ambiguity and enhance discontinuous NER performance. Specifically, we treat the context gap as an additional type of span and convert span classification into a token-pair grid tagging task. Subsequently, we design two interactive components to comprehensively model token-pair grid features from both intra- and inter-span perspectives. The intra-span regularity extraction module employs the biaffine mechanism along with linear attention to capture the internal regularity of each span, while the inter-span relation enhancement module utilizes criss-cross attention to obtain semantic relations among different spans. At the inference stage of entity decoding, we assign a directed edge to each entity fragment and context gap, then use the BFS algorithm to search for all valid paths from the head to tail of grids with entity tags. Experimental results on three datasets demonstrate that our GapDNER achieves new state-of-the-art performance on discontinuous NER and exhibits remarkable advantages in recognizing complex entity structures.[96] Evaluating Language Models' Evaluations of Games
Katherine M. Collins,Cedegao E. Zhang,Graham Todd,Lance Ying,Mauricio Barba da Costa,Ryan Liu,Prafull Sharma,Adrian Weller,Ionatan Kuperwajs,Lionel Wong,Joshua B. Tenenbaum,Thomas L. Griffiths
Main category: cs.CL
TL;DR: 本文提出了一种评估AI系统评估能力的新范式,通过分析超过100个新棋盘游戏和450项人类判断数据,比较现代语言与推理模型在评估游戏收益(公平性)和趣味性方面的表现。研究发现,推理模型比非推理语言模型更贴近人类评价,但当模型接近博弈论最优时,其与人类判断的契合度反而下降;同时,在评估“趣味性”时模型表现更不稳定,反映出量化难度更高。此外,推理模型在资源使用上表现出高度不可预测性,表明需增强其资源理性的元推理能力。
Details
Motivation: 传统AI评估主要关注问题解决能力,而忽视了对‘哪些问题值得解决’这一评估能力的考察。本文旨在建立新的评估范式,衡量AI系统在复杂且主观性较强的决策任务中的判断能力,特别是在游戏设计中的公平性和趣味性评估。 Method: 提出一种用于评估AI评价能力的正式框架,并利用包含100多个新棋盘游戏及450多项人类判断的大规模数据集,对比现代语言模型、推理模型与人类及符号计算代理在两种评估任务(游戏收益/公平性与趣味性)上的表现,分析其一致性与资源使用模式。 Result: 推理模型在评估结果上比非推理语言模型更接近人类判断;但随着模型趋近博弈论最优,其与人类判断的一致性反而减弱;在‘趣味性’评估中模型间差异更大,表现出更高的不稳定性;推理模型在处理不同查询时资源消耗变化大且难以预测。 Conclusion: 当前推理模型虽在人类一致性方面优于非推理模型,但在逼近理论最优时可能偏离人类直觉判断,尤其在难以量化的主观维度(如趣味性)上表现更差。未来需引入资源理性的元推理机制,以提升AI在复杂评估任务中的效率与人类对齐程度。 Abstract: Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.[97] End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study
Anirudh Ganesh,Jayavardhan Reddy
Main category: cs.CL
TL;DR: 本文复现了Ma和Hovy(2016)提出的BiLSTM-CNN-CRF模型,并在CoNLL-2003 NER任务上取得了91.18%的F1分数,验证了其在序列标注任务中的有效性。
Details
Motivation: 为了验证Ma和Hovy(2016)提出的神经网络架构在序列标注任务上的可复现性和有效性,并推动相关研究的发展。 Method: 采用字符级CNN、词级BiLSTM和CRF相结合的端到端模型,无需手工特征,在命名实体识别和词性标注任务上进行实验。 Result: 成功复现了原模型的关键结果,在CoNLL-2003 NER数据集上达到91.18%的F1分数,并开源了基于PyTorch的实现代码。 Conclusion: 该模型在序列标注任务中表现优异,且具有良好的可复现性,开源实现有助于后续研究。 Abstract: We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features while achieving excellent performance on named entity recognition (NER) and part-of-speech (POS) tagging tasks. Our implementation successfully reproduces the key results, achieving 91.18\% F1-score on CoNLL-2003 NER and demonstrating the model's effectiveness across sequence labeling tasks. We provide a detailed analysis of the architecture components and release an open-source PyTorch implementation to facilitate further research.[98] Punctuation-aware treebank tree binarization
Eitan Klinger,Vivaan Wadhwa,Jungyeul Park
Main category: cs.CL
TL;DR: 本文提出了一种标点符号感知的树库二值化资源和评估套件,通过在二值化前保留标点符号作为兄弟节点,显著提升了头词预测准确率,并提供了可复现的流程与配套评估工具。
Details
Motivation: 标准的二值化流程在头词选择前会删除标点符号,导致成分结构变形并损害头词识别效果,因此需要一种能保留标点信息的方法以提升句法分析质量。 Method: 设计了一个可复现的预处理流程,在二值化之前将标点符号保留为兄弟节点,并生成中间标记、可逆性签名和对齐索引等辅助产物,同时构建了涵盖头词预测、可逆性和结构兼容性的评估套件。 Result: 在Penn Treebank上,该方法将头词预测准确率从Collins规则的73.66%和MLP的86.66%提升至91.85%,并在与CCGbank派生结果的对齐任务中表现具有竞争力。 Conclusion: 保留标点符号的预处理显著提升了句法分析性能,所发布的代码、配置文件和文档支持复现并扩展到其他语料库。 Abstract: This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\% (Collins rules) and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.[99] KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification
Yejin Lee,Su-Hyeon Kim,Hyundong Jin,Dayoung Kim,Yeonsoo Kim,Yo-Sub Han
Main category: cs.CL
TL;DR: 本文提出了KOTOX,首个支持韩语去混淆和去毒化的数据集,旨在解决低资源语言中大模型难以识别和处理经过混淆的有毒内容的问题。
Details
Motivation: 由于现有研究主要集中于英语,低资源语言(如韩语)中的有毒内容检测与净化研究不足,且用户常使用混淆技术规避检测,导致大语言模型难以有效应对。 Method: 基于韩语的语言特征对各种混淆方法进行分类,并根据真实示例定义一组转换规则,构建了三种不同混淆难度级别(简单、普通、困难)的数据集版本。 Result: 成功构建了KOTOX数据集,首次同时支持韩语的去混淆和去毒化任务,为低资源语言中对抗混淆型有毒内容提供了有效资源。 Conclusion: KOTOX数据集有助于提升大语言模型对低资源语言中隐蔽性有毒内容的理解与处理能力,推动相关研究的发展。 Abstract: Toxic content has become an increasingly critical social issue with the rapid expansion of online communication. While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(LLMs) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in LLM for low-resource languages. Our code and data are available at https://github.com/leeyejin1231/KOTOX.[100] Judge Before Answer: Can MLLM Discern the False Premise in Question?
Jidong Li,Lingyong Fang,Haodong Zhao,Sufeng Duan,Gongshen Liu
Main category: cs.CL
TL;DR: 提出了一种自动化构建虚假前提问题基准的管道,创建了JBA数据集,并提出了增强框架以提高多模态大语言模型对虚假前提的识别能力。
Details
Motivation: 现有基准在细粒度分类和覆盖范围上有限,无法充分评估模型识别虚假前提的能力。 Method: 系统地将前提分为三种主要类型和十三个子类型,构建JBA数据集,并提出针对性的识别增强框架。 Result: 实验表明,使用新框架训练的模型在虚假前提识别上显著优于当前MLLMs。 Conclusion: 该研究有效提升了多模态大语言模型对虚假前提的鲁棒性和识别能力。 Abstract: Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.[101] RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection
Yejin Lee,Hyeseon Ahn,Yo-Sub Han
Main category: cs.CL
TL;DR: 本文提出了一种基于强化学习的多模块投票框架RV-HATE,能够根据不同仇恨言论数据集的特性自适应调整检测过程,在提升检测准确性的同时提供可解释性。
Details
Motivation: 现有的仇恨言论检测方法通常采用固定模式,难以适应来自不同平台、具有多样语言风格和上下文特征的数据集,导致对隐式仇恨言论的检测效果不佳。 Method: 提出RV-HATE框架,包含多个专注于不同语言或上下文特征的专用模块,利用强化学习优化各模块在特定数据集上的权重,并通过投票机制融合输出结果。 Result: RV-HATE在多个数据集上优于传统静态方法,尤其在隐式仇恨言论检测方面表现突出,同时能提供关于数据集特征的可解释性分析。 Conclusion: RV-HATE通过数据集自适应的模块化设计和强化学习加权机制,有效提升了仇恨言论检测的准确性和可解释性,适用于多样化来源的文本数据。 Abstract: Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)~it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)~it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at https://github.com/leeyejin1231/RV-HATE.[102] Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
Zhiwen Ruan,Yixia Li,He Zhu,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
Main category: cs.CL
TL;DR: 提出了一种名为Critical Token Fine-tuning (CFT)的方法,通过仅微调对推理关键的少量令牌来提升大语言模型在数学推理任务中的性能和生成多样性。
Details
Motivation: 标准的监督微调(SFT)对所有令牌施加均匀惩罚,忽略了只有少数关键令牌决定推理正确性,导致输出多样性降低和泛化能力受限。 Method: 通过反事实扰动识别出功能上不可或缺的关键令牌,并仅对这些令牌进行梯度更新,保留非关键令牌的多样性。 Result: 在五个模型、三个模型家族和十一个数学推理基准上的实验表明,CFT在微调不到12%令牌的情况下始终优于标准SFT,提升了生成质量、采样多样性和强化学习初始化效果。 Conclusion: CFT是一种高效且通用的大语言模型微调框架,能够增强推理性能与输出多样性,适用于测试时扩展和后续训练阶段。 Abstract: Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.[103] DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety
Wei-Chieh Huang,Henry Peng Zou,Yaozu Wu,Dongyuan Li,Yankai Chen,Weizhi Zhang,Yangning Li,Angelo Zangari,Jizhou Guo,Chunyu Miao,Liancheng Fang,Langzhou He,Renhe Jiang,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出了DEEPRESEARCHGUARD,一个具有四阶段防护机制的深度研究框架,通过开放域评估和阶段式安全措施显著提升报告质量与安全性。
Details
Motivation: 现有深度研究框架在评估和阶段保护方面存在不足,忽视了报告可信度、连贯性、广度、深度和安全性等关键因素,可能导致有害内容被纳入最终报告。 Method: 设计了一个包含输入、计划、研究和报告四阶段防护的综合框架,并提出DRSAFEBENCH作为深研安全评测基准,对多种主流大模型进行多维度评估。 Result: DEEPRESEARCHGUARD平均防御成功率提升18.16%,过度拒绝率降低6%;输入防护有效过滤早期风险,计划与研究防护增强了引用规范性和来源可信度。 Conclusion: 该框架实现了有效的分阶段防御和全面的开放域评估,能在不增加过度拒绝的情况下系统性提升报告质量和安全性。 Abstract: Deep research frameworks have shown promising capabilities in synthesizing comprehensive reports from web sources. While deep research possesses significant potential to address complex issues through planning and research cycles, existing frameworks are deficient in sufficient evaluation procedures and stage-specific protections. They typically treat evaluation as exact match accuracy of question-answering, but overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety. This oversight may result in hazardous or malicious sources being integrated into the final report. To address these issues, we introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports. We assess performance across multiple metrics, e.g., defense success rate and over-refusal rate, and five key report dimensions. In the absence of a suitable safety benchmark, we introduce DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success rate improvement of 18.16% while reducing over-refusal rate by 6%. The input guard provides the most substantial early-stage protection by filtering out obvious risks, while the plan and research guards enhance citation discipline and source credibility. Through extensive experiments, we show that DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation, while systematically improving report quality without excessive over-refusal rates. The code can be found via https://github.com/Jasonya/DeepResearchGuard.[104] ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
Mahika Phutane,Hayoung Jung,Matthew Kim,Tanushree Mitra,Aditya Vashistha
Main category: cs.CL
TL;DR: 该研究通过2820个招聘场景审计了六个大语言模型,提出ABLEIST指标体系以评估针对残障人士的偏见及其与性别、种姓等交叉形式的歧视,发现现有模型在检测此类交叉性伤害方面存在严重不足。
Details
Motivation: 现有研究多关注西方背景下的身份歧视,忽视全球南方地区残障人士因性别、种姓等交叉边缘化所面临的独特偏见,亟需更全面的评估框架。 Method: 构建包含残障、性别、国籍和种姓多样性的招聘情境数据集,提出ABLEIST(含五个能力主义特有和三个交叉性伤害指标)评估体系,对六个大语言模型进行系统审计。 Result: 发现大语言模型对残障求职者存在显著的ABLEIST伤害,且在性别和种姓交叉边缘化群体中此类伤害加剧,尤其是象征性包容(Tokenism)等现象更为突出,而当前前沿模型的安全检测工具未能有效识别这些偏差。 Conclusion: 强调在高风险应用场景如招聘中,必须引入交叉性视角的安全评估机制,以揭示并缓解大语言模型中的系统性偏见。 Abstract: Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.[105] DND: Boosting Large Language Models with Dynamic Nested Depth
Tieyuan Chen,Xiaodong Chen,Haoxing Chen,Zhenzhong Lan,Weiyao Lin,Jianguo Li
Main category: cs.CL
TL;DR: 提出了一种名为Dynamic Nested Depth (DND) 的新方法,通过在Transformer层中动态选择关键token进行嵌套重处理,提升现成大语言模型的性能。
Details
Motivation: 为了在不显著增加计算开销的前提下,提升预训练语言模型在推理过程中的效率与准确性,特别是针对难处理的token进行精细化处理。 Method: 在Transformer层末尾引入一个路由器机制,识别关键token并将其反馈回同一层进行额外处理,采用路由控制损失和阈值控制策略实现动态、稳定的选择。 Result: 在Qwen3-1.7B(密集模型)上性能提升1.88%,在Qwen3-30B-A3B(MoE模型)上提升0.87%,仅带来极小的参数和计算开销增加。 Conclusion: DND是一种高效、即插即用的方法,能有效增强现成大模型的表达能力,具有良好的通用性和实用性。 Abstract: We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.[106] LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models
Yiwei Liu,Yucheng Li,Xiao Li,Gong Cheng
Main category: cs.CL
TL;DR: 本文提出了一种名为LogiNumSynth的自然语言问题合成器,用于生成需要逻辑与数值联合推理的任务,支持对任务复杂度的细粒度控制,并可用于评估和提升大语言模型在联合推理上的表现。
Details
Motivation: 现有数据集依赖固定的规则集,难以灵活控制任务复杂度,限制了在联合逻辑-数值推理任务上的评估与训练泛化能力。 Method: 设计并实现了一个名为LogiNumSynth的合成器,能够生成包含丰富逻辑推理深度和数值计算复杂度的自然语言问题,支持对推理世界丰富性、逻辑深度和数值复杂度的精细调控。 Result: 实验证明多个大语言模型在逻辑-数值联合推理上存在明显弱点,LogiNumSynth不仅能有效评估模型的推理过程与答案准确性,还可用于针对性训练以提升模型性能。 Conclusion: LogiNumSynth是一个灵活且可控的联合推理任务生成框架,可作为诊断工具和训练数据源,推动语言模型在逻辑与数值综合推理能力上的发展。 Abstract: Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer -- synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis -- evaluating both process accuracy and answer accuracy; (3) Targeted Training -- using synthesized data to enhance LLMs' reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.[107] Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks
Wenya Xie,Qingying Xiao,Yu Zheng,Xidong Wang,Junying Chen,Ke Ji,Anningzhe Gao,Prayag Tiwari,Xiang Wan,Feng Jiang,Benyou Wang
Main category: cs.CL
TL;DR: 提出将大语言模型(LLM)重新定位为临床医生的助手,并构建了大规模中文医学数据集DoctorFLAN,以提升LLM在医生工作流中的表现。
Details
Motivation: 直接将大语言模型用于患者存在安全风险,因其缺乏足够的医学专业性,因此需要开发面向医生的临床辅助模型。 Method: 通过两阶段的启发-反馈调查识别临床需求,构建包含92,000个问答实例的中文医学数据集DoctorFLAN,并提出DoctorFLAN-test和DotaBench两个评测基准。 Result: 在十多个主流LLM上的实验表明,DoctorFLAN显著提升了开源模型在医疗场景下的性能,尤其在医生面向的任务中表现更优。 Conclusion: DoctorFLAN为面向医生的医学大模型提供了有效训练数据和评估框架,推动了以医生为中心的医疗AI发展。 Abstract: The rise of large language models (LLMs) has transformed healthcare by offering clinical guidance, yet their direct deployment to patients poses safety risks due to limited domain expertise. To mitigate this, we propose repositioning LLMs as clinical assistants that collaborate with experienced physicians rather than interacting with patients directly. We conduct a two-stage inspiration-feedback survey to identify real-world needs in clinical workflows. Guided by this, we construct DoctorFLAN, a large-scale Chinese medical dataset comprising 92,000 Q&A instances across 22 clinical tasks and 27 specialties. To evaluate model performance in doctor-facing applications, we introduce DoctorFLAN-test (550 single-turn Q&A items) and DotaBench (74 multi-turn conversations). Experimental results with over ten popular LLMs demonstrate that DoctorFLAN notably improves the performance of open-source LLMs in medical contexts, facilitating their alignment with physician workflows and complementing existing patient-oriented models. This work contributes a valuable resource and framework for advancing doctor-centered medical LLM development[108] Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States
Qinglin Zhu,Yizhen Yao,Runcong Zhao,Yanzheng Xiang,Amrutha Saseendran,Chen Jin,Philip Alexander Teare,Bin Liang,Yulan He,Lin Gui
Main category: cs.CL
TL;DR: 提出了一种名为Latent Refinement Decoding(LRD)的两阶段并行解码框架,通过保持未确定位置的分布混合和预测反馈循环,在减少信息损失和避免过早决策的同时提升生成准确性和解码速度。
Details
Motivation: 现有的自回归模型因串行解码导致高延迟,而新兴的扩散式并行生成方法存在信息丢失和过早承诺的问题,限制了生成质量与效率。 Method: LRD包含两个阶段:第一阶段将掩码位置表示为预测token与掩码嵌入的分布混合,以建立更全局一致的预测;第二阶段通过预测反馈循环逐步固化高置信度token,并保留不确定性token进行迭代优化,使用KL散度动态判断收敛与提前停止。 Result: 在编程(HumanEval +6.3, MBPP +2.6)和推理任务(GSM8K +2.9, MATH500 +3.8)上显著提升准确率,同时实现最高10.6倍的加速。 Conclusion: LRD有效平衡了并行生成中的准确性与效率,克服了现有方法的信息丢失和过早决策问题,是一种强大且通用的并行序列生成新方案。 Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.[109] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Junjie Lu,Yuliang Liu,Chaofeng Qu,Wei Shen,Zhouhan Lin,Min Xu
Main category: cs.CL
TL;DR: 本文提出了一种基于置信度引导的推理路径偏好优化方法(CGPO),利用模型自身生成的非人类推理路径,在最低置信度点进行干预,以减少轨迹偏移,在代码和数学推理任务中优于依赖强模型或人工标注的方法。
Details
Motivation: 现有强化LLM推理的方法存在对人类推理路径的训练偏差,且依赖人工或强模型标注中间步骤,限制了非人类推理路径的探索;同时发现模型首次出错常发生在最低置信度点之后,因此在该点前进行干预可提供更准确的监督。 Method: 提出CGPO方法,通过置信度信号识别推理过程中的最大不确定性点,并在此处引入模型自生成的非人类推理路径进行引导,从而优化推理轨迹。 Result: 在多种模型和任务(代码与数学推理)上的实验表明,使用小模型生成的CGPO数据在相同训练量下多数情况下优于使用强模型或人工标注数据的方法。 Conclusion: CGPO通过置信度引导和自生成推理路径有效缓解了传统偏好优化中的轨迹漂移问题,提升了LLM推理性能,减少了对高质量外部标注的依赖。 Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.[110] TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code
Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic
Main category: cs.CL
TL;DR: 本文提出了TypePilot,一种通过强类型和可验证语言(以Scala为例)提升大语言模型生成代码安全性的代理式AI框架。实验表明,该方法在形式化验证和通用安全代码生成中显著减少了输入验证和注入漏洞。
Details
Motivation: 大语言模型在代码生成方面表现出色,但其输出常包含关键安全漏洞,限制了其在高安全性要求场景中的应用。因此需要一种能增强生成代码安全性和鲁棒性的方法。 Method: 提出TypePilot框架,利用强类型语言的类型系统和形式化验证工具(如Stainless),构建一个类型导向的代理式AI工作流,引导LLM生成更安全、可验证的代码。 Result: 实验显示,相较于直接生成代码或简单提示生成安全代码,TypePilot显著降低了输入验证错误和注入漏洞的发生率,在安全代码生成方面表现更优。 Conclusion: 类型引导的结构化LLM工作流(如TypePilot)具有提升高保障领域自动化代码生成可信度的巨大潜力。 Abstract: Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.[111] One Size Does Not Fit All: Exploring Variable Thresholds for Distance-Based Multi-Label Text Classification
Jens Van Nooten,Andriy Kosar,Guy De Pauw,Walter Daelemans
Main category: cs.CL
TL;DR: 本文研究了基于距离的多标签文本分类方法,提出了一种新的标签特定阈值优化方法,在多种数据集上显著提升了分类性能,且在标注样本有限的情况下表现良好。
Details
Motivation: 现有的基于距离的文本分类方法在多标签场景下缺乏有效的阈值设定策略,导致分类效果受限,因此需要探索更优的阈值优化方法。 Method: 通过在多个真实多标签文本分类数据集上进行实验,分析不同模型、数据集和标签集之间的语义相似性分布差异,并提出一种基于验证集的标签特定阈值优化方法。 Result: 实验发现语义相似性分布在不同模型、数据集和标签集之间存在显著差异;所提出的标签特定阈值方法相比归一化0.5阈值平均提升46%,优于以往的统一阈值方法14%。 Conclusion: 标签特定的阈值优化能有效提升基于距离的多标签文本分类性能,具有良好的适应性和实用性,尤其适用于标签集动态变化或标注数据稀缺的场景。 Abstract: Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.[112] Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages
Paloma Piot,José Ramom Pichel Campos,Javier Parapar
Main category: cs.CL
TL;DR: 本文通过整合和标准化现有资源,构建了欧洲西班牙语的仇恨言论数据集元集合,并将其翻译为欧洲葡萄牙语及两种加利西亚语变体,创建了对齐的多语言语料库,提出了伊比利亚语言仇恨言论检测的新基准。
Details
Motivation: 当前仇恨言论检测研究主要集中在英语,低资源语言缺乏足够的数据和基准,且常忽略语言内部的多样性,导致大模型难以有效应用。 Method: 系统分析并整合现有数据集,统一标签和元数据;通过翻译生成欧洲葡萄牙语和两种加利西亚语变体的对齐语料库;在零样本、少样本和微调设置下评估大语言模型,并进行跨语言分析。 Result: 建立了伊比利亚语言的多语言仇恨言论检测基准;实验表明多语言和语言变体感知方法的有效性;提供了未来研究的基线结果和跨语言性能比较。 Conclusion: 多语言和语言变体感知的方法对提升低资源语言的仇恨言论检测至关重要,该研究为欧洲少数语言的检测任务提供了重要数据基础和评估框架。 Abstract: Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.[113] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Johannes Moll,Markus Graf,Tristan Lemke,Nicolas Lenhart,Daniel Truhn,Jean-Benoit Delbrouck,Jiazhen Pan,Daniel Rueckert,Lisa C. Adams,Keno K. Bressem
Main category: cs.CL
TL;DR: 提出一种临床导向的框架,评估胸部X光视觉问答中视觉-语言模型链式思维解释的可信度,发现答案准确性与解释质量脱节。
Details
Motivation: 现有评估方法很少捕捉到视觉-语言模型在临床场景中产生的看似合理但不真实的推理过程,影响信任度。 Method: 通过在文本和图像上进行受控修改,从临床保真度、因果归因和置信度校准三个维度评估模型解释的忠实性,并开展放射科医生参与的读者研究。 Result: 评估者与放射科医生的相关性在各轴上均处于放射科医生间一致性范围内,归因对齐较强(Kendall's τ_b=0.670),保真度中等(τ_b=0.387),置信度较弱(τ_b=0.091);六个VLM的基准测试显示答案准确性和解释质量解耦,专有模型在归因和保真度上表现更优。 Conclusion: 仅依赖最终答案准确性不足以保证临床可信度,需综合评估解释的忠实性,提示部署中的潜在风险。 Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.[114] Discursive Circuits: How Do Language Models Understand Discourse Relations?
Yisong Miao,Min-Yen Kan
Main category: cs.CL
TL;DR: 本文提出了一种名为CuDR的任务和对比语料库,用于发现控制语言模型处理篇章关系的稀疏计算图(即“篇章电路”),实验证明仅需极小部分模型参数即可恢复对篇章的理解能力,并具有跨框架泛化性。
Details
Motivation: 探究Transformer语言模型中负责篇章理解的具体组件,尤其是复杂、长距离的篇章关系如何被模型处理。 Method: 提出Completion under Discourse Relation (CuDR)任务和基于PDTB的最小对比对语料库,结合激活补丁技术进行电路发现,通过稀疏电路分析模型内部机制。 Result: 发现了仅占GPT-2模型约0.2%参数的稀疏电路即可有效恢复篇章理解能力;这些电路在RST和SDRT等未见框架上具有良好泛化性;低层捕捉词汇语义和共指等语言特征,高层编码篇章级抽象;特征效用跨框架一致。 Conclusion: 篇章理解由特定的稀疏电路(discursive circuits)实现,这些电路分层组织,支持跨框架泛化,揭示了模型处理复杂篇章关系的内部机制。 Abstract: Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2\%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).[115] Domain-Specific Data Generation Framework for RAG Adaptation
Chris Xing Tian,Weihao Xie,Zhen Chen,Zhengyuan Yi,Hui Liu,Haoliang Li,Shiqi Wang,Siwei Ma
Main category: cs.CL
TL;DR: RAGen是一个可扩展、模块化的框架,用于生成领域相关的问答上下文三元组,以有效适应特定领域的检索增强生成(RAG)系统。
Details
Motivation: 现有的RAG系统在特定领域适应时缺乏足够的上下文丰富训练数据,需要更专业的方法来提升性能。 Method: 通过语义分块、分层概念提取和多块检索,结合受布鲁姆分类学启发的问题生成原则,生成多样问题,并与精确答案和相关上下文配对。 Result: RAGen能高效处理大规模动态文档库,支持多种RAG组件优化,并引入干扰上下文以增强推理鲁棒性。 Conclusion: RAGen为科学和企业知识等动态领域提供了高效、可扩展的RAG适配数据生成方案。 Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom's Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.[116] The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
Saad Obaid ul Islam,Anne Lauscher,Goran Glavaš
Main category: cs.CL
TL;DR: 本文提出了SLAQ评估框架,用于衡量大语言模型在简单与复杂查询中回答事实性问题的一致性,发现模型在不同任务复杂度下存在系统性的答案不一致,并揭示了位置依赖性和动量效应,挑战了当前认为简单查询表现好即代表复杂任务可靠的评估假设。
Details
Motivation: 大语言模型在简单事实问答中表现良好,但在复杂任务中可能表现出不一致,这种可靠性差距尚未被充分理解,影响了模型的可信度。 Method: 提出Short-Long Form Alignment for Factual Question Answering (SLAQ) 框架,对比16个大语言模型在600个事实问题中单独提问(短形式)与嵌入复杂查询(长形式)下的回答一致性,并进行机制分析。 Result: 发现模型在短和长查询中存在系统性答案不一致,存在位置依赖性准确率下降和动量效应;机制分析显示一致的事实激活重叠的模型内部状态,且基于机制相似性的指标可预测长短答案一致性(最高达78%准确率)。 Conclusion: 事实一致性是评估大语言模型可信度的重要维度,当前仅依赖简单问答任务的评估方式存在缺陷,需考虑任务复杂度的影响。 Abstract: Large language models (LLMs) can correctly answer "When was Einstein born?" yet fail to provide the same date when writing about Einstein's life revealing a fundamental inconsistency in how models access factual knowledge across task complexities. While models display impressive accuracy on factual question-answering benchmarks, the reliability gap between simple and complex queries remains poorly understood, eroding their trustworthiness. In this work, we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a controlled evaluation framework that compares LLMs' answers to the same factual questions asked (a) in isolation (short) vs. (b) integrated into complex queries (long). Looking at 16 LLMs across 600 queries, we find a systematic misalignment of answers to the corresponding short and long queries. We further uncover position-dependent accuracy loss and momentum effects where consecutive correct or incorrect answers create self-reinforcing patterns. Through mechanistic analysis, we find that aligned facts activate overlapping model internals, and that metrics based on mechanistic similarity can predict short-long answer alignment with up to 78% accuracy. Our work establishes factual consistency over query complexity as an important aspect of LLMs' trustworthiness and challenges current evaluation practices, which implicitly assume that good performance for simple factual queries implies reliability in more complex knowledge-seeking tasks too.[117] WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent
Tao Li,Jinlong Hu,Yang Wang,Junfeng Liu,Xuejun Liu
Main category: cs.CL
TL;DR: 提出WebRouter,一种基于信息论视角的查询特定路由器,通过成本感知的变分信息瓶颈(ca-VIB)目标,在显著降低操作成本的同时保持高精度,实现在真实网站上的高效网页代理性能。
Details
Motivation: LLM驱动的网页代理在实现网络自动化方面具有强大能力,但面临成本与性能之间的权衡问题,尤其是复杂的提示导致LLM集成性能下降。 Method: 引入WebRouter,采用成本感知的变分信息瓶颈(ca-VIB)目标函数,学习输入提示的压缩表示,并显式惩罚预期操作成本。 Result: 在WebVoyager基准的五个真实网站上实验表明,相比GPT-4o基线,WebRouter降低了87.8%的操作成本,仅带来3.8%的准确率下降。 Conclusion: WebRouter有效平衡了网页代理的成本与性能,通过信息压缩和成本感知优化,显著提升了实际应用中的效率。 Abstract: LLM-brained web agents offer powerful capabilities for web automation but face a critical cost-performance trade-off. The challenge is amplified by web agents' inherently complex prompts that include goals, action histories, and environmental states, leading to degraded LLM ensemble performance. To address this, we introduce WebRouter, a novel query-specific router trained from an information-theoretic perspective. Our core contribution is a cost-aware Variational Information Bottleneck (ca-VIB) objective, which learns a compressed representation of the input prompt while explicitly penalizing the expected operational cost. Experiments on five real-world websites from the WebVoyager benchmark show that WebRouter reduces operational costs by a striking 87.8\% compared to a GPT-4o baseline, while incurring only a 3.8\% accuracy drop.cs.CV [Back]
[118] TinyViT-Batten: Few-Shot Vision Transformer with Explainable Attention for Early Batten-Disease Detection on Pediatric MRI
Khartik Uppalapati,Bora Yimenicioglu,Shakeel Abdulkareem,Adan Eftekhari,Bhavya Uppalapati,Viraj Kamath
Main category: cs.CV
TL;DR: 提出了一种基于少量样本的视觉Transformer模型TinyViT-Batten,用于从儿童脑MRI中检测早期巴顿病,在小样本条件下实现了高准确率和可解释性。
Details
Motivation: 巴顿病是一种罕见的儿童神经退行性疾病,早期MRI征象不明显且易被忽略,现有方法在有限病例下难以有效检测。 Method: 通过知识蒸馏将大型ViT模型压缩为5M参数的TinyViT,并采用基于度量的少样本学习(原型损失,5-shot)进行微调,结合Grad-CAM实现可解释性预测。 Result: 在包含79例基因确诊巴顿病患者和90例年龄匹配对照的多中心数据集上,模型准确率约91%,ROC曲线下面积至少0.95,敏感性>90%,特异性约90%,优于3D-ResNet和Swin-Tiny基线模型。 Conclusion: TinyViT-Batten在小样本条件下表现出优异性能,具备良好的实用性和可解释性,有望成为早期巴顿病AI辅助诊断的有效工具。 Abstract: Batten disease (neuronal ceroid lipofuscinosis) is a rare pediatric neurodegenerative disorder whose early MRI signs are subtle and often missed. We propose TinyViT-Batten, a few-shot Vision Transformer (ViT) framework to detect early Batten disease from pediatric brain MRI with limited training cases. We distill a large teacher ViT into a 5 M-parameter TinyViT and fine-tune it using metric-based few-shot learning (prototypical loss with 5-shot episodes). Our model achieves high accuracy (approximately 91%) and area under ROC of at least 0.95 on a multi-site dataset of 79 genetically confirmed Batten-disease MRIs (27 CLN3 from the Hochstein natural-history study, 32 CLN2 from an international longitudinal cohort, 12 early-manifestation CLN2 cases reported by Cokal et al., and 8 public Radiopaedia scans) together with 90 age-matched controls, outperforming a 3D-ResNet and Swin-Tiny baseline. We further integrate Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight disease-relevant brain regions, enabling explainable predictions. The model's small size and strong performance (sensitivity greater than 90%, specificity approximately 90%) demonstrates a practical AI solution for early Batten disease detection.[119] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition
Ranjan Sapkota,Manoj Karkee
Main category: cs.CV
TL;DR: 本文综述了Ultralytics YOLO系列目标检测器的架构演进、性能 benchmarking、部署视角及未来挑战,重点介绍了YOLOv26至YOLOv5的技术进展,并在MS COCO上进行定量比较,探讨其在实际场景中的应用与未来发展方向。
Details
Motivation: 为了系统梳理YOLO系列模型的快速发展,明确各版本之间的技术演进路径,提供统一的性能评估,并指导未来研究方向。 Method: 通过回顾YOLOv5到YOLOv26的架构演变,分析其关键技术创新,并在MS COCO数据集上对多个YOLO版本及其它先进检测器进行综合benchmarking,同时讨论部署方式和应用场景。 Result: YOLOv26通过DFL移除、无NMS推理、ProgLoss、STAL和MuSGD等改进实现了更高效稳定的检测性能;benchmark结果显示不同YOLO版本在精度与速度间存在权衡;模型已广泛应用于机器人、农业、监控和制造等领域。 Conclusion: YOLO系列持续进化,在精度、速度和部署灵活性方面表现优异,但仍在密集场景、开放词汇检测等方面面临挑战,未来将趋向CNN-Transformer融合与边缘智能训练。 Abstract: This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26, alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM. Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches.[120] TreeNet: Layered Decision Ensembles
Zeshan Khan
Main category: cs.CV
TL;DR: 本文提出了一种名为TreeNet的新型分层决策集成学习方法,结合了神经网络、集成学习和基于树的决策模型的优点,用于解决医学图像分析中数据有限的问题。该方法在准确性和实时性方面表现出色,即使在训练数据减少50%的情况下仍保持较高的F1分数。
Details
Motivation: 医学图像分析中常面临数据量不足和数据可信度低的问题,现有方法在小样本情况下表现不佳,因此需要一种高效且可解释的模型来提升疾病检测性能。 Method: TreeNet通过融合神经网络、集成学习和树形决策模型的关键特征,构建了一个多层集成决策框架,并在医学图像数据上进行端到端训练与评估。 Result: 在完整训练数据下F1-score达到0.85,在仅使用50%数据时为0.77,推理速度达32帧/秒,具备实时应用能力。 Conclusion: TreeNet在医学图像分析任务中展现出高效性、鲁棒性和可解释性,尤其适用于数据稀缺和实时性要求高的临床场景。 Abstract: Within the domain of medical image analysis, three distinct methodologies have demonstrated commendable accuracy: Neural Networks, Decision Trees, and Ensemble-Based Learning Algorithms, particularly in the specialized context of genstro institutional track abnormalities detection. These approaches exhibit efficacy in disease detection scenarios where a substantial volume of data is available. However, the prevalent challenge in medical image analysis pertains to limited data availability and data confidence. This paper introduces TreeNet, a novel layered decision ensemble learning methodology tailored for medical image analysis. Constructed by integrating pivotal features from neural networks, ensemble learning, and tree-based decision models, TreeNet emerges as a potent and adaptable model capable of delivering superior performance across diverse and intricate machine learning tasks. Furthermore, its interpretability and insightful decision-making process enhance its applicability in complex medical scenarios. Evaluation of the proposed approach encompasses key metrics including Accuracy, Precision, Recall, and training and evaluation time. The methodology resulted in an F1-score of up to 0.85 when using the complete training data, with an F1-score of 0.77 when utilizing 50\% of the training data. This shows a reduction of F1-score of 0.08 while in the reduction of 50\% of the training data and training time. The evaluation of the methodology resulted in the 32 Frame per Second which is usable for the realtime applications. This comprehensive assessment underscores the efficiency and usability of TreeNet in the demanding landscape of medical image analysis specially in the realtime analysis.[121] OmniSAT: Compact Action Token, Faster Auto Regression
Huaihai Lyu,Chaofan Chen,Senwei Xie,Pengwei Wang,Xiansheng Chen,Shanghang Zhang,Changsheng Xu
Main category: cs.CV
TL;DR: 提出Omni Swift Action Tokenizer (OmniSAT),通过多阶段残差量化实现高效且紧凑的动作表示,显著缩短训练序列长度并提升自回归模型的训练效率和性能。
Details
Motivation: 现有自回归动作标记化方法在压缩效率和重建质量之间存在权衡,难以兼顾高效训练与高精度动作还原。 Method: 采用B-Spline编码对动作进行标准化表示,并在位置、旋转和夹爪子空间上应用多阶段残差量化,生成粗到细的离散动作标记;结合跨具身学习策略,利用机器人和人类演示进行联合训练。 Result: 在Droid数据集上预训练后,训练序列长度缩短6.8倍,目标熵降低;在多种真实机器人和仿真任务中验证了更高的压缩率与良好的重建质量,加速了自回归模型的收敛。 Conclusion: OmniSAT实现了高效、可迁移的动作标记化,支持更快速的自回归策略训练,并通过统一动作模式空间促进跨具身学习。 Abstract: Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with \textit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$\times$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.[122] Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series
Zhengsen Xu,Yimin Zhu,Zack Dewis,Mabel Heffring,Motasem Alkayid,Saeid Taleghanidoozdoozan,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 提出了一种知识感知的Mamba模型(KAMamba),用于提升MODIS时间序列变化检测性能,结合知识驱动的转移矩阵损失、多任务学习和解耦时空谱信息的SSTMamba模块,在加拿大萨斯喀彻温省数据集上显著优于基线方法。
Details
Motivation: MODIS时间序列变化检测面临混合像素、时空谱信息耦合和背景类别异质性等挑战,现有方法难以有效建模类别转移并充分利用先验知识。 Method: 设计了知识感知的KAMamba模型:1)引入知识驱动的转移矩阵引导方法和KAT-loss;2)采用多任务学习框架,结合PreC-loss、PostC-loss和Chg-loss;3)提出SSTMamba模块解耦时空谱信息;4)使用稀疏可变形的SDMamba骨干网络提升效率。 Result: 在萨斯喀彻温MODIS数据集上,变化检测的平均F1得分比基线高1.5-6%,LULC分类的OA、AA和Kappa系数均提升约2%。 Conclusion: KAMamba通过融合领域知识与改进Mamba架构,有效提升了MODIS时间序列的变化检测与分类性能,具有较强的应用潜力。 Abstract: Although change detection using MODIS time series is critical for environmental monitoring, it is a highly challenging task due to key MODIS difficulties, e.g., mixed pixels, spatial-spectral-temporal information coupling effect, and background class heterogeneity. This paper presents a novel knowledge-aware Mamba (KAMamba) for enhanced MODIS change detection, with the following contributions. First, to leverage knowledge regarding class transitions, we design a novel knowledge-driven transition-matrix-guided approach, leading to a knowledge-aware transition loss (KAT-loss) that can enhance detection accuracies. Second, to improve model constraints, a multi-task learning approach is designed, where three losses, i.e., pre-change classification loss (PreC-loss), post-change classification loss (PostC-loss), and change detection loss (Chg-loss) are used for improve model learning. Third, to disentangle information coupling in MODIS time series, novel spatial-spectral-temporal Mamba (SSTMamba) modules are designed. Last, to improve Mamba model efficiency and remove computational cost, a sparse and deformable Mamba (SDMamba) backbone is used in SSTMamba. On the MODIS time-series dataset for Saskatchewan, Canada, we evaluate the method on land-cover change detection and LULC classification; results show about 1.5-6% gains in average F1 for change detection over baselines, and about 2% improvements in OA, AA, and Kappa for LULC classification.[123] NNDM: NN_UNet Diffusion Model for Brain Tumor Segmentation
Sashank Makanaboyina
Main category: cs.CV
TL;DR: 提出NNDM框架,结合NN-UNet与扩散模型,提升脑肿瘤MRI分割的精度与鲁棒性。
Details
Motivation: 现有CNN模型在泛化能力、边界精度和数据多样性方面存在不足,难以满足脑肿瘤精准分割的需求。 Method: 将NN-UNet的特征提取能力与扩散模型的生成能力结合,通过扩散模型学习预测掩码与真实掩码间的残差误差分布,逐步优化分割结果。 Result: 在BraTS 2021数据集上,NNDM在Dice系数和Hausdorff距离指标上均优于U-Net和基于Transformer的基线模型,且在不同模态和肿瘤子区域中表现出更强的鲁棒性。 Conclusion: NNDM为确定性分割网络与随机扩散模型的融合提供了新方向,推动了自动化脑肿瘤分析的技术进步。 Abstract: Accurate detection and segmentation of brain tumors in magnetic resonance imaging (MRI) are critical for effective diagnosis and treatment planning. Despite advances in convolutional neural networks (CNNs) such as U-Net, existing models often struggle with generalization, boundary precision, and limited data diversity. To address these challenges, we propose NNDM (NN\_UNet Diffusion Model)a hybrid framework that integrates the robust feature extraction of NN-UNet with the generative capabilities of diffusion probabilistic models. In our approach, the diffusion model progressively refines the segmentation masks generated by NN-UNet by learning the residual error distribution between predicted and ground-truth masks. This iterative denoising process enables the model to correct fine structural inconsistencies and enhance tumor boundary delineation. Experiments conducted on the BraTS 2021 datasets demonstrate that NNDM achieves superior performance compared to conventional U-Net and transformer-based baselines, yielding improvements in Dice coefficient and Hausdorff distance metrics. Moreover, the diffusion-guided refinement enhances robustness across modalities and tumor subregions. The proposed NNDM establishes a new direction for combining deterministic segmentation networks with stochastic diffusion models, advancing the state of the art in automated brain tumor analysis.[124] Adaptive Fusion Network with Temporal-Ranked and Motion-Intensity Dynamic Images for Micro-expression Recognition
Thi Bich Phuong Man,Luu Tu Nguyen,Vu Tram Anh Khuong,Thanh Ha Le,Thi Duyen Ngo
Main category: cs.CV
TL;DR: 提出了一种新的微表情识别方法,通过两种互补的动态图像表示和自适应融合网络显著提升了识别性能。
Details
Motivation: 微表情具有重要的应用价值,但因其强度低、持续时间短而难以识别,现有方法在特征提取和噪声抑制方面存在不足。 Method: 提出了时序排序动态图像和运动强度动态图像两种互补表示,并设计了自适应融合网络来自最优整合这两种表示。 Result: 在CASME-II、SAMM和MMEW三个数据集上取得了优于现有方法的性能,特别是在CASME-II上达到了93.95%的准确率和0.897的UF1分数。 Conclusion: 所提出的输入表示和网络架构对提升微表情识别性能具有重要作用,为情感计算等相关领域的研究和应用提供了坚实基础。 Abstract: Micro-expressions (MEs) are subtle, transient facial changes with very low intensity, almost imperceptible to the naked eye, yet they reveal a person genuine emotion. They are of great value in lie detection, behavioral analysis, and psychological assessment. This paper proposes a novel MER method with two main contributions. First, we propose two complementary representations - Temporal-ranked dynamic image, which emphasizes temporal progression, and Motion-intensity dynamic image, which highlights subtle motions through a frame reordering mechanism incorporating motion intensity. Second, we propose an Adaptive fusion network, which automatically learns to optimally integrate these two representations, thereby enhancing discriminative ME features while suppressing noise. Experiments on three benchmark datasets (CASME-II, SAMM and MMEW) demonstrate the superiority of the proposed method. Specifically, AFN achieves 93.95 Accuracy and 0.897 UF1 on CASME-II, setting a new state-of-the-art benchmark. On SAMM, the method attains 82.47 Accuracy and 0.665 UF1, demonstrating more balanced recognition across classes. On MMEW, the model achieves 76.00 Accuracy, further confirming its generalization ability. The obtained results show that both the input and the proposed architecture play important roles in improving the performance of MER. Moreover, they provide a solid foundation for further research and practical applications in the fields of affective computing, lie detection, and human-computer interaction.[125] Multi Camera Connected Vision System with Multi View Analytics: A Comprehensive Survey
Muhammad Munsif,Waqas Ahmad,Amjid Ali,Mohib Ullah,Adnan Hussain,Sung Wook Baik
Main category: cs.CV
TL;DR: 本论文首次对多视角多摄像头(MVMC)系统进行综合集成综述,涵盖MVMC跟踪、重识别(Re-ID)和动作理解(AU),提出统一框架和新分类体系,并系统总结了数据集、方法、结果与评估指标,探讨了现实场景中的挑战与未来研究方向。
Details
Motivation: 现有综述多关注孤立任务或单视角设置,缺乏对MVMC系统中多任务集成与多视角协作的全面分析,难以应对真实动态环境中的复杂挑战。 Method: 提出一种新的分类法,将CVS分为MVMC跟踪、Re-ID、AU和联合方法四部分,系统梳理并整合现有研究工作、数据集、技术方法与评价指标。 Result: 全面总结了MVMC领域的最新进展,明确了当前技术在处理遮挡、视角变化和环境多样性方面的局限性,并归纳了开放问题与新兴技术(如持续学习、隐私保护、联邦学习)的应用前景。 Conclusion: 强调需进一步提升CVS在复杂真实应用中的鲁棒性、效率与适应性,指出了未来关键研究方向,旨在推动下一代智能自适应CVS的发展。 Abstract: Connected Vision Systems (CVS) are transforming a variety of applications, including autonomous vehicles, smart cities, surveillance, and human-robot interaction. These systems harness multi-view multi-camera (MVMC) data to provide enhanced situational awareness through the integration of MVMC tracking, re-identification (Re-ID), and action understanding (AU). However, deploying CVS in real-world, dynamic environments presents a number of challenges, particularly in addressing occlusions, diverse viewpoints, and environmental variability. Existing surveys have focused primarily on isolated tasks such as tracking, Re-ID, and AU, often neglecting their integration into a cohesive system. These reviews typically emphasize single-view setups, overlooking the complexities and opportunities provided by multi-camera collaboration and multi-view data analysis. To the best of our knowledge, this survey is the first to offer a comprehensive and integrated review of MVMC that unifies MVMC tracking, Re-ID, and AU into a single framework. We propose a unique taxonomy to better understand the critical components of CVS, dividing it into four key parts: MVMC tracking, Re-ID, AU, and combined methods. We systematically arrange and summarize the state-of-the-art datasets, methodologies, results, and evaluation metrics, providing a structured view of the field's progression. Furthermore, we identify and discuss the open research questions and challenges, along with emerging technologies such as lifelong learning, privacy, and federated learning, that need to be addressed for future advancements. The paper concludes by outlining key research directions for enhancing the robustness, efficiency, and adaptability of CVS in complex, real-world applications. We hope this survey will inspire innovative solutions and guide future research toward the next generation of intelligent and adaptive CVS.[126] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Dwip Dalal,Gautam Vashishtha,Utkarsh Mishra,Jeonghwan Kim,Madhav Kanda,Hyeonjeong Ha,Svetlana Lazebnik,Heng Ji,Unnat Jain
Main category: cs.CV
TL;DR: AttWarp是一种轻量级方法,通过基于多模态大语言模型(MLLM)的跨模态注意力对输入图像进行矩形扭曲,动态重分配空间分辨率,突出查询相关区域,同时保持全局上下文完整。该方法在不修改模型权重或结构的情况下,提升了MLLM在细粒度感知定位任务中的准确性、组合推理能力和抗幻觉能力。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在复杂场景中容易忽略细节和空间关系,导致细粒度感知定位错误。因此,需要一种无需修改模型结构即可提升其对小物体和细微关系识别能力的方法。 Method: 提出AttWarp方法,利用MLLM在推理时的跨模态注意力图,对输入图像进行可逆的矩形空间扭曲(rectilinear warping),将更高分辨率分配给注意力较高的区域,压缩无关区域,从而在保留全部原始信息和全局布局的同时优化输入表征。 Result: 在五个基准数据集(TextVQA, GQA, DocVQA, POPE, MMMU)和四种主流MLLM(LLaVA, Qwen-VL, InternVL, InstructBLIP)上验证了AttWarp的有效性,一致提升了准确率,增强了组合推理能力并减少了幻觉现象,且优于四种竞争性的测试时图像预处理基线方法。 Conclusion: AttWarp通过注意力引导的空间重映射,在不改变模型的前提下显著提升了MLLM对关键细节的感知能力,证明了优化输入信息分布是提升多模态模型表现的有效途径。 Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.[127] Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning
Yufei Wang,Adriana Kovashka,Loretta Fernández,Marc N. Coutanche,Seth Wiener
Main category: cs.CV
TL;DR: 研究了在图像-文本配对的多模态语境中,学习者推断外语生词含义的新场景,分析影响人类参与者表现的因素,并探讨AI系统对其表现的推理能力。
Details
Motivation: 探索在多模态环境下外语学习者如何通过上下文推断生词意义,并识别影响推断成功的关键因素。 Method: 通过让不同语言背景的人类参与者完成图像-文本配对任务,分析数据特征与参与者表现之间的相关性,并评估AI系统对人类表现的推理能力。 Result: 发现仅部分直观特征与参与者表现有强相关性,且AI系统在预测人类表现方面展现出潜力。 Conclusion: 需要进一步研究更有效的预测特征,并提升AI系统在模拟和理解人类学习过程中的推理能力。 Abstract: We investigate a new setting for foreign language learning, where learners infer the meaning of unfamiliar words in a multimodal context of a sentence describing a paired image. We conduct studies with human participants using different image-text pairs. We analyze the features of the data (i.e., images and texts) that make it easier for participants to infer the meaning of a masked or unfamiliar word, and what language backgrounds of the participants correlate with success. We find only some intuitive features have strong correlations with participant performance, prompting the need for further investigating of predictive features for success in these tasks. We also analyze the ability of AI systems to reason about participant performance, and discover promising future directions for improving this reasoning ability.[128] Task-Aware Resolution Optimization for Visual Large Language Models
Weiqing Luo,Zhen Tan,Yifan Li,Xinyu Zhao,Kwonjoon Lee,Behzad Dariush,Tianlong Chen
Main category: cs.CV
TL;DR: 本文提出了一种根据图像复杂度和模型不确定性方差来确定视觉语言任务最优输入分辨率的实证公式,并设计了高效的微调方法,以提升现有视觉大模型在不同任务中的性能。
Details
Motivation: 现有视觉语言模型通常使用固定分辨率,导致在不同任务中表现不佳,难以满足实际应用中对感知粒度的多样化需求。 Method: 通过分析不同任务对分辨率的偏好,结合图像复杂度与模型在不同分辨率下的不确定性方差,提出一个确定最优分辨率的实证公式,并设计参数高效微调技术以扩展预训练模型的输入分辨率。 Result: 在多个视觉语言任务上进行实验,验证了所提方法在提升模型性能方面的有效性。 Conclusion: 该方法能够动态适配不同任务所需的分辨率,显著提升视觉语言模型的泛化能力与实用性。 Abstract: Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.[129] Post Processing of image segmentation using Conditional Random Fields
Aashish Dhawan,Pankaj Bodani,Vishal Garg
Main category: cs.CV
TL;DR: 本研究旨在通过评估不同类型的条件随机场(CRF)来提升卫星图像分割的清晰度,特别是在低质量特征下的表现,并在卫星图像和高质量航拍图像上进行实验比较。
Details
Motivation: 由于卫星图像特征质量较低,图像分割结果通常不够清晰,因此需要找到一种合适的CRF模型来改善分割效果。 Method: 研究对比了多种不同类型的CRF模型,分析其适用性,并在两个数据集(低质量卫星图像和高质量航拍图像)上进行实验评估。 Result: 通过实验比较了不同CRF在两类图像上的表现,揭示了各种方法的优缺点。 Conclusion: 某些CRF模型在提升低质量卫星图像分割清晰度方面具有潜力,但其有效性依赖于具体的数据特性和模型选择。 Abstract: The output of image the segmentation process is usually not very clear due to low quality features of Satellite images. The purpose of this study is to find a suitable Conditional Random Field (CRF) to achieve better clarity in a segmented image. We started with different types of CRFs and studied them as to why they are or are not suitable for our purpose. We evaluated our approach on two different datasets - Satellite imagery having low quality features and high quality Aerial photographs. During the study we experimented with various CRFs to find which CRF gives the best results on images and compared our results on these datasets to show the pitfalls and potentials of different approaches.[130] Exploration of Incremental Synthetic Non-Morphed Images for Single Morphing Attack Detection
David Benavente-Rios,Juan Ruiz Rodriguez,Gustavo Gatica
Main category: cs.CV
TL;DR: 该研究探讨了使用合成面部数据来增强单一人脸融合攻击检测(S-MAD),发现适量引入合成数据可提升模型泛化能力,但完全依赖合成数据会导致性能下降。
Details
Motivation: 由于隐私问题导致真实人脸图像数据集难以大规模获取,因此需要探索合成数据在S-MAD中的应用潜力。 Method: 采用多种人脸融合工具和跨数据集评估方案,并通过逐步增加合成图像的增量测试协议评估模型泛化能力。 Result: 实验结果表明,适量加入合成图像或逐步引入真实图像可提升泛化性能,但滥用合成数据会带来次优结果;仅使用合成数据时达到最高等错误率(EER),表现最差。 Conclusion: 在实际应用中,不应完全依赖合成数据进行S-MAD,而应谨慎结合真实与合成数据以获得最佳性能。 Abstract: This paper investigates the use of synthetic face data to enhance Single-Morphing Attack Detection (S-MAD), addressing the limitations of availability of large-scale datasets of bona fide images due to privacy concerns. Various morphing tools and cross-dataset evaluation schemes were utilized to conduct this study. An incremental testing protocol was implemented to assess the generalization capabilities as more and more synthetic images were added. The results of the experiments show that generalization can be improved by carefully incorporating a controlled number of synthetic images into existing datasets or by gradually adding bona fide images during training. However, indiscriminate use of synthetic data can lead to sub-optimal performance. Evenmore, the use of only synthetic data (morphed and non-morphed images) achieves the highest Equal Error Rate (EER), which means in operational scenarios the best option is not relying only on synthetic data for S-MAD.[131] Cell Instance Segmentation: The Devil Is in the Boundaries
Peixian Liang,Yifan Ding,Yizhe Zhang,Jianxu Chen,Hao Zheng,Hongxiao Wang,Yejia Zhang,Guangyu Meng,Tim Weninger,Michael Niemier,X. Sharon Hu,Danny Z Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为Ceb的新型像素聚类方法,用于细胞实例分割,通过利用细胞边界特征和标签来划分前景像素,克服了传统像素级目标在保留细胞几何特性方面的局限性。
Details
Motivation: 现有基于深度学习的细胞实例分割方法依赖于像素级目标(如距离图、热扩散图等),容易丢失细胞实例的重要几何属性(如形状、曲率、凸性)。需要一种能更好保留这些全局几何信息的新方法。 Method: Ceb方法首先从语义分割生成的概率图出发,使用改进的Watershed算法提取潜在的前景-前景边界;然后为每个边界候选构建包含局部边界的特征表示(称为边界签名);接着使用边界分类器根据签名预测二值边界标签;最后基于预测的边界标签对邻近区域进行分割或合并以获得最终的细胞实例。 Result: 在六个数据集上的实验表明,Ceb在语义分割概率图上优于现有的像素聚类方法,并且与当前最先进的细胞实例分割方法相比具有很强的竞争力。 Conclusion: Ceb通过显式建模和分类细胞边界,有效提升了细胞实例分割的性能,证明了利用边界特征对于保持细胞几何结构的重要性。 Abstract: State-of-the-art (SOTA) methods for cell instance segmentation are based on deep learning (DL) semantic segmentation approaches, focusing on distinguishing foreground pixels from background pixels. In order to identify cell instances from foreground pixels (e.g., pixel clustering), most methods decompose instance information into pixel-wise objectives, such as distances to foreground-background boundaries (distance maps), heat gradients with the center point as heat source (heat diffusion maps), and distances from the center point to foreground-background boundaries with fixed angles (star-shaped polygons). However, pixel-wise objectives may lose significant geometric properties of the cell instances, such as shape, curvature, and convexity, which require a collection of pixels to represent. To address this challenge, we present a novel pixel clustering method, called Ceb (for Cell boundaries), to leverage cell boundary features and labels to divide foreground pixels into cell instances. Starting with probability maps generated from semantic segmentation, Ceb first extracts potential foreground-foreground boundaries with a revised Watershed algorithm. For each boundary candidate, a boundary feature representation (called boundary signature) is constructed by sampling pixels from the current foreground-foreground boundary as well as the neighboring background-foreground boundaries. Next, a boundary classifier is used to predict its binary boundary label based on the corresponding boundary signature. Finally, cell instances are obtained by dividing or merging neighboring regions based on the predicted boundary labels. Extensive experiments on six datasets demonstrate that Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps. Moreover, Ceb achieves highly competitive performance compared to SOTA cell instance segmentation methods.[132] Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation
Zhi Chen,Xin Yu,Xiaohui Tao,Yan Li,Zi Huang
Main category: cs.CV
TL;DR: 提出了一种名为CAPEL的聚类感知提示集成学习框架,通过在分类logits空间而非特征空间中集成提示,保留上下文提示的聚类特性,提升视觉-语言模型的零样本迁移性能。
Details
Motivation: 传统提示集成方法通过平均文本特征导致类别中心偏离真实分布,影响模型性能。 Method: 在logits空间进行提示集成,并引入聚类保持正则化项和自适应提示加权机制,以保持提示的特异性并动态调整权重。 Result: CAPEL在多个数据集和任务上表现出更强的鲁棒性和性能,优于传统的特征空间平均方法。 Conclusion: CAPEL有效保留了提示的聚类结构,提升了视觉-语言模型的零样本分类效果。 Abstract: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.[133] Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking
Milad Khanchi,Maria Amer,Charalambos Poullis
Main category: cs.CV
TL;DR: 提出一种新的多目标跟踪方法,通过融合深度和掩码特征并使用自监督编码器生成稳定的对象表示,避免计算分割掩码的IoU,在复杂场景下超越现有追踪-by-检测方法。
Details
Motivation: 传统MOT方法依赖IoU进行关联,在对象相似或遮挡时不可靠,且分割掩码的IoU计算昂贵。需要更鲁棒、高效的匹配机制。 Method: 利用零样本深度估计器获取深度图,使用可提示的视觉分割模型获得对象掩码,融合深度与掩码特征后输入紧凑的自监督训练编码器,生成稳定对象表示,并结合边界框IoU和重识别特征进行匹配。 Result: 在SportsMOT和DanceTrack等具有非线性运动、遮挡和拥挤场景的挑战性基准上,该TBD方法在多数指标上优于现有追踪-by-检测方法的最先进水平,在MOT17等简单基准上表现具竞争力。 Conclusion: 本文首次在MOT中使用自监督编码器优化分割掩码表示而不计算掩码IoU,所提方法在保持效率的同时显著提升复杂场景下的跟踪性能。 Abstract: Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.[134] CHUG: Crowdsourced User-Generated HDR Video Quality Dataset
Shreshth Saini,Alan C. Bovik,Neil Birkbeck,Yilin Wang,Balu Adsumilli
Main category: cs.CV
TL;DR: 本文提出了首个针对用户生成内容的高动态范围视频质量评估数据集CHUG,包含856个源视频和5992个经过转码的视频,并通过大规模众包实验收集了211,848条主观评分,旨在推动无参考HDR-VQA的研究。
Details
Motivation: 现有HDR视频质量评估数据集主要关注专业生成内容,缺乏对真实世界中用户生成内容(UGC)中HDR视频退化问题的理解,因此需要构建一个针对UGC-HDR的质量评估数据集。 Method: 构建了一个包含856个UGC-HDR源视频的大规模数据集CHUG,对这些视频进行多分辨率和多比特率转码,生成共5992个视频,并通过Amazon Mechanical Turk平台开展大规模主观质量评分实验。 Result: 成功构建了CHUG数据集,包含5992个视频和211,848条主观感知质量评分,为UGC-HDR视频的质量评估提供了基准。 Conclusion: CHUG填补了UGC-HDR视频质量评估领域的空白,有望推动无参考HDR-VQA模型的发展,并为真实场景下的视频质量研究提供重要支持。 Abstract: High Dynamic Range (HDR) videos enhance visual experiences with superior brightness, contrast, and color depth. The surge of User-Generated Content (UGC) on platforms like YouTube and TikTok introduces unique challenges for HDR video quality assessment (VQA) due to diverse capture conditions, editing artifacts, and compression distortions. Existing HDR-VQA datasets primarily focus on professionally generated content (PGC), leaving a gap in understanding real-world UGC-HDR degradations. To address this, we introduce CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, the first large-scale subjective study on UGC-HDR quality. CHUG comprises 856 UGC-HDR source videos, transcoded across multiple resolutions and bitrates to simulate real-world scenarios, totaling 5,992 videos. A large-scale study via Amazon Mechanical Turk collected 211,848 perceptual ratings. CHUG provides a benchmark for analyzing UGC-specific distortions in HDR videos. We anticipate CHUG will advance No-Reference (NR) HDR-VQA research by offering a large-scale, diverse, and real-world UGC dataset. The dataset is publicly available at: https://shreshthsaini.github.io/CHUG/.[135] Geometry-Aware Scene Configurations for Novel View Synthesis
Minkwan Kim,Changwoon Choi,Young Min Kim
Main category: cs.CV
TL;DR: 提出场景自适应策略,通过几何先验优化基点布置和虚拟视角生成,提升不完整观测下室内场景沉浸式渲染的效率与质量。
Details
Motivation: 室内场景常具有复杂布局、遮挡和杂乱物体,传统NeRF方法在资源有限的情况下难以高效生成高质量渲染结果,因此需要更智能的资源分配策略。 Method: 利用预处理得到的几何骨架估计观测统计,指导基点的最优放置,并引入场景自适应的虚拟视点以弥补输入轨迹中视角配置的不足,增强正则化。 Result: 在多个大规模室内场景中验证了方法的有效性,相比采用规则布置的基线方法,在渲染质量和内存使用方面均有显著提升。 Conclusion: 场景自适应的基点分配和虚拟视点策略能有效提升NeRF在复杂室内环境下的表示效率和渲染质量,尤其适用于观测不完整的场景。 Abstract: We propose scene-adaptive strategies to efficiently allocate representation capacity for generating immersive experiences of indoor environments from incomplete observations. Indoor scenes with multiple rooms often exhibit irregular layouts with varying complexity, containing clutter, occlusion, and flat walls. We maximize the utilization of limited resources with guidance from geometric priors, which are often readily available after pre-processing stages. We record observation statistics on the estimated geometric scaffold and guide the optimal placement of bases, which greatly improves upon the uniform basis arrangements adopted by previous scalable Neural Radiance Field (NeRF) representations. We also suggest scene-adaptive virtual viewpoints to compensate for geometric deficiencies inherent in view configurations in the input trajectory and impose the necessary regularization. We present a comprehensive analysis and discussion regarding rendering quality and memory requirements in several large-scale indoor scenes, demonstrating significant enhancements compared to baselines that employ regular placements.[136] LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates
Minkwan Kim,Seungmin Lee,Junho Kim,Young Min Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为LTGS的长期高斯场景时序表示方法,能够从稀疏视角更新中高效建模日常环境的长期变化,适用于非结构化、不完整的 casual 捕获场景。
Details
Motivation: 现有的新视角合成方法在面对日常环境中频繁的场景变化时,因需要密集的时空观测而难以应用于稀疏且非结构化的 casual 捕获。因此,需要一种能适应长期、稀疏输入的动态场景表示方法。 Method: 提出LTGS框架,将物体建模为模板高斯(template Gaussians),作为可重用的结构先验;通过一个精细化流程,基于少量观测对先验进行调制以适应环境变化;利用简单变换实现跨时间步的泛化,支持轻量快速更新。 Result: 在自采集的真实世界数据集上实验表明,该方法在重建质量上优于基线方法,同时支持快速、轻量的场景更新。 Conclusion: LTGS提供了一种可扩展、高效的长期动态场景建模方案,能够在稀疏输入条件下稳健地捕捉真实世界环境的时空演化。 Abstract: Recent advances in novel-view synthesis can create the photo-realistic visualization of real-world environments from conventional camera captures. However, acquiring everyday environments from casual captures faces challenges due to frequent scene changes, which require dense observations both spatially and temporally. We propose long-term Gaussian scene chronology from sparse-view updates, coined LTGS, an efficient scene representation that can embrace everyday changes from highly under-constrained casual captures. Given an incomplete and unstructured Gaussian splatting representation obtained from an initial set of input images, we robustly model the long-term chronology of the scene despite abrupt movements and subtle environmental variations. We construct objects as template Gaussians, which serve as structural, reusable priors for shared object tracks. Then, the object templates undergo a further refinement pipeline that modulates the priors to adapt to temporally varying environments based on few-shot observations. Once trained, our framework is generalizable across multiple time steps through simple transformations, significantly enhancing the scalability for a temporal evolution of 3D environments. As existing datasets do not explicitly represent the long-term real-world changes with a sparse capture setup, we collect real-world datasets to evaluate the practicality of our pipeline. Experiments demonstrate that our framework achieves superior reconstruction quality compared to other baselines while enabling fast and light-weight updates.[137] An uncertainty-aware framework for data-efficient multi-view animal pose estimation
Lenny Aharon,Keemin Lee,Karan Sikka,Selmaan Chettih,Cole Hurwitz,Liam Paninski,Matthew R Whiteway
Main category: cs.CV
TL;DR: 提出了一种多视角变换器(MVT)框架,结合新型训练、后处理和模型蒸馏技术,实现高效、准确且具有不确定性感知的动物姿态估计。
Details
Motivation: 现有方法在标注数据有限的情况下难以实现精确追踪,且不确定性估计差,限制了真实科研场景中的应用。 Method: 采用预训练骨干网络的多视角变换器(MVT),引入补丁掩码机制学习跨视角对应关系;结合3D增强和三角损失利用几何一致性;扩展非线性情况下的集成卡尔曼平滑器(EKS)并改进其不确定性量化;设计蒸馏流程利用高质量伪标签减少人工标注依赖。 Result: 在三种不同动物(果蝇、小鼠、山雀)数据上均优于现有方法,各组件带来互补性能提升,显著提高姿态估计精度与不确定性估计能力。 Conclusion: 该框架是一种实用、可靠且具备不确定性感知的多视角姿态估计系统,适用于真实世界数据受限条件下的行为分析。 Abstract: Multi-view pose estimation is essential for quantifying animal behavior in scientific research, yet current methods struggle to achieve accurate tracking with limited labeled data and suffer from poor uncertainty estimates. We address these challenges with a comprehensive framework combining novel training and post-processing techniques, and a model distillation procedure that leverages the strengths of these techniques to produce a more efficient and effective pose estimator. Our multi-view transformer (MVT) utilizes pretrained backbones and enables simultaneous processing of information across all views, while a novel patch masking scheme learns robust cross-view correspondences without camera calibration. For calibrated setups, we incorporate geometric consistency through 3D augmentation and a triangulation loss. We extend the existing Ensemble Kalman Smoother (EKS) post-processor to the nonlinear case and enhance uncertainty quantification via a variance inflation technique. Finally, to leverage the scaling properties of the MVT, we design a distillation procedure that exploits improved EKS predictions and uncertainty estimates to generate high-quality pseudo-labels, thereby reducing dependence on manual labels. Our framework components consistently outperform existing methods across three diverse animal species (flies, mice, chickadees), with each component contributing complementary benefits. The result is a practical, uncertainty-aware system for reliable pose estimation that enables downstream behavioral analyses under real-world data constraints.[138] SpectralCA: Bi-Directional Cross-Attention for Next-Generation UAV Hyperspectral Vision
D. V. Brovko
Main category: cs.CV
TL;DR: 提出一种结合高光谱成像与深度学习的混合2D/3D卷积架构,通过引入SpectralCA模块提升UAV在复杂环境中的感知能力,实现实时导航与目标识别。
Details
Motivation: 由于干扰、低可见度或伪装等因素,传统导航在复杂环境中不可靠,而高光谱成像(HSI)可提供精细的材质识别能力,满足无人机在导航、监控、农业和环境监测中的需求。 Method: 改进Mobile 3D Vision Transformer(MDvT),引入SpectralCA模块,采用双向交叉注意力机制融合光谱与空间特征,并在WHU-Hi-HongHu数据集上进行训练与评估。 Result: 在WHU-Hi-HongHu数据集上的实验表明,该方法在总体精度、平均精度和Kappa系数上表现优异,提升了感知效率,减少了参数量和推理时间,支持实时操作。 Conclusion: 所提出的SpectralCA架构有效增强了无人机在复杂环境下的感知能力,适用于导航、目标检测与地形分类,具有实际应用潜力。 Abstract: The relevance of this research lies in the growing demand for unmanned aerial vehicles (UAVs) capable of operating reliably in complex environments where conventional navigation becomes unreliable due to interference, poor visibility, or camouflage. Hyperspectral imaging (HSI) provides unique opportunities for UAV-based computer vision by enabling fine-grained material recognition and object differentiation, which are critical for navigation, surveillance, agriculture, and environmental monitoring. The aim of this work is to develop a deep learning architecture integrating HSI into UAV perception for navigation, object detection, and terrain classification. Objectives include: reviewing existing HSI methods, designing a hybrid 2D/3D convolutional architecture with spectral-spatial cross-attention, training, and benchmarking. The methodology is based on the modification of the Mobile 3D Vision Transformer (MDvT) by introducing the proposed SpectralCA block. This block employs bi-directional cross-attention to fuse spectral and spatial features, enhancing accuracy while reducing parameters and inference time. Experimental evaluation was conducted on the WHU-Hi-HongHu dataset, with results assessed using Overall Accuracy, Average Accuracy, and the Kappa coefficient. The findings confirm that the proposed architecture improves UAV perception efficiency, enabling real-time operation for navigation, object recognition, and environmental monitoring tasks. Keywords: SpectralCA, deep learning, computer vision, hyperspectral imaging, unmanned aerial vehicle, object detection, semi-supervised learning.[139] HeadsUp! High-Fidelity Portrait Image Super-Resolution
Renjie Li,Zihao Zhu,Xiaoyu Wang,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文研究了肖像图像超分辨率(PortraitISR)问题,提出了一种名为HeadsUp的单步扩散模型,能够端到端地无缝恢复和放大肖像图像。
Details
Motivation: 现有图像超分辨率技术在处理肖像照片时通常使用多个模型混合方法,导致面部区域出现融合或边界伪影,而人类对脸部保真度非常敏感。因此需要一种能统一处理整个人像图片的方法。 Method: 基于单步扩散模型构建HeadsUp,并开发了面部监督机制以引导模型关注面部区域;引入基于参考的机制来辅助身份恢复,减少低质量面部恢复中的模糊性;同时构建了一个高质量4K肖像图像数据集PortraitSR-4K用于训练和评估。 Result: 实验表明,HeadsUp在PortraitISR任务上达到了最先进的性能,同时在通用图像和对齐人脸数据集上也保持了相当或更高的性能。 Conclusion: HeadsUp通过端到端的单模型方法有效解决了肖像图像超分辨率中的融合伪影问题,在面部细节恢复和整体图像质量方面均表现出优越性能。 Abstract: Portrait pictures, which typically feature both human subjects and natural backgrounds, are one of the most prevalent forms of photography on social media. Existing image super-resolution (ISR) techniques generally focus either on generic real-world images or strictly aligned facial images (i.e., face super-resolution). In practice, separate models are blended to handle portrait photos: the face specialist model handles the face region, and the general model processes the rest. However, these blending approaches inevitably introduce blending or boundary artifacts around the facial regions due to different model training recipes, while human perception is particularly sensitive to facial fidelity. To overcome these limitations, we study the portrait image supersolution (PortraitISR) problem, and propose HeadsUp, a single-step diffusion model that is capable of seamlessly restoring and upscaling portrait images in an end-to-end manner. Specifically, we build our model on top of a single-step diffusion model and develop a face supervision mechanism to guide the model in focusing on the facial region. We then integrate a reference-based mechanism to help with identity restoration, reducing face ambiguity in low-quality face restoration. Additionally, we have built a high-quality 4K portrait image ISR dataset dubbed PortraitSR-4K, to support model training and benchmarking for portrait images. Extensive experiments show that HeadsUp achieves state-of-the-art performance on the PortraitISR task while maintaining comparable or higher performance on both general image and aligned face datasets.[140] Denoising Diffusion as a New Framework for Underwater Images
Nilesh Jain,Elie Alhajjar
Main category: cs.CV
TL;DR: 本文提出利用去噪扩散模型扩展水下图像数据集,并结合ControlNet提升图像质量,以克服现有数据集多样性不足和质量低下的问题,从而改善海洋生态系统研究。
Details
Motivation: 现有水下图像数据集缺乏多样性且质量较低,限制了图像增强方法的泛化能力和应用效果,亟需构建更丰富、高质量的数据集。 Method: 采用去噪扩散模型生成多种类型的水下图像(如立体、广角、微距等),并通过ControlNet进行图像增强与质量评估,提升数据集整体质量。 Result: 所提方法能够有效扩充水下图像数据集,提高图像清晰度、色彩还原度和细节表现,增强后续生态研究的可靠性。 Conclusion: 结合生成模型与控制增强网络可显著改善水下图像质量和数据集多样性,为海洋环境监测和生态研究提供有力支持。 Abstract: Underwater images play a crucial role in ocean research and marine environmental monitoring since they provide quality information about the ecosystem. However, the complex and remote nature of the environment results in poor image quality with issues such as low visibility, blurry textures, color distortion, and noise. In recent years, research in image enhancement has proven to be effective but also presents its own limitations, like poor generalization and heavy reliance on clean datasets. One of the challenges herein is the lack of diversity and the low quality of images included in these datasets. Also, most existing datasets consist only of monocular images, a fact that limits the representation of different lighting conditions and angles. In this paper, we propose a new plan of action to overcome these limitations. On one hand, we call for expanding the datasets using a denoising diffusion model to include a variety of image types such as stereo, wide-angled, macro, and close-up images. On the other hand, we recommend enhancing the images using Controlnet to evaluate and increase the quality of the corresponding datasets, and hence improve the study of the marine ecosystem. Tags - Underwater Images, Denoising Diffusion, Marine ecosystem, Controlnet[141] Semi-disentangled spatiotemporal implicit neural representations of longitudinal neuroimaging data for trajectory classification
Agampreet Aulakh,Nils D. Forkert,Matthias Wilms
Main category: cs.CV
TL;DR: 提出一种基于隐式神经表示(INR)的全数据驱动方法,用于建模个体纵向T1加权MRI数据中的脑老化轨迹,并在不规则采样情况下优于传统深度学习模型。
Details
Motivation: 传统深度学习方法难以处理纵向神经影像数据的离散性和不规则时空采样,无法有效表征连续的生物学老化过程。 Method: 使用隐式神经表示(INR)将个体纵向MRI数据建模为连续函数,设计新型INR架构以部分解耦时空参数,并在INR参数空间上直接构建分类框架。 Result: 在模拟的450名健康与类痴呆受试者数据上,该方法在不规则时间采样下的脑老化轨迹分类准确率达到81.3%,优于基线模型的73.7%。 Conclusion: 所提INR方法能更有效地建模脑结构变化的连续轨迹,适用于不规则采样的纵向脑成像分析,具有较高的分类性能和应用潜力。 Abstract: The human brain undergoes dynamic, potentially pathology-driven, structural changes throughout a lifespan. Longitudinal Magnetic Resonance Imaging (MRI) and other neuroimaging data are valuable for characterizing trajectories of change associated with typical and atypical aging. However, the analysis of such data is highly challenging given their discrete nature with different spatial and temporal image sampling patterns within individuals and across populations. This leads to computational problems for most traditional deep learning methods that cannot represent the underlying continuous biological process. To address these limitations, we present a new, fully data-driven method for representing aging trajectories across the entire brain by modelling subject-specific longitudinal T1-weighted MRI data as continuous functions using Implicit Neural Representations (INRs). Therefore, we introduce a novel INR architecture capable of partially disentangling spatial and temporal trajectory parameters and design an efficient framework that directly operates on the INRs' parameter space to classify brain aging trajectories. To evaluate our method in a controlled data environment, we develop a biologically grounded trajectory simulation and generate T1-weighted 3D MRI data for 450 healthy and dementia-like subjects at regularly and irregularly sampled timepoints. In the more realistic irregular sampling experiment, our INR-based method achieves 81.3% accuracy for the brain aging trajectory classification task, outperforming a standard deep learning baseline model (73.7%).[142] Explainable Human-in-the-Loop Segmentation via Critic Feedback Signals
Pouya Shaeri,Ryan T. Woo,Yasaman Mohammadpour,Ariane Middel
Main category: cs.CV
TL;DR: 提出一种人机交互框架,通过人类对分割结果的修正作为干预信号,帮助模型摆脱对数据集伪相关性的依赖,学习更鲁棒、语义一致的特征,显著提升跨域泛化能力并减少标注成本。
Details
Motivation: 现有分割模型在基准测试中表现良好,但在真实场景中因依赖颜色、纹理等伪特征而失效,难以泛化。需要一种能识别并纠正模型错误依赖机制的方法。 Method: 构建一个人机交互系统,将人类对分割输出的修正确视为干预信号,识别模型对表面特征的错误依赖;通过在视觉相似图像间传播修正信息,引导模型学习语义相关特征,并在整个数据集中系统性纠正失败模式。 Result: 在具有挑战性的cubemap数据上,分割精度最高提升9 mIoU(相对提高12-15%),相比标准重训练减少3-4倍标注工作量,同时在基准数据集上保持竞争力。 Conclusion: 该框架能有效提升分割模型的鲁棒性、泛化能力和数据效率,适用于城市气候监测、自动驾驶等易受数据偏差影响的真实应用场景。 Abstract: Segmentation models achieve high accuracy on benchmarks but often fail in real-world domains by relying on spurious correlations instead of true object boundaries. We propose a human-in-the-loop interactive framework that enables interventional learning through targeted human corrections of segmentation outputs. Our approach treats human corrections as interventional signals that show when reliance on superficial features (e.g., color or texture) is inappropriate. The system learns from these interventions by propagating correction-informed edits across visually similar images, effectively steering the model toward robust, semantically meaningful features rather than dataset-specific artifacts. Unlike traditional annotation approaches that simply provide more training data, our method explicitly identifies when and why the model fails and then systematically corrects these failure modes across the entire dataset. Through iterative human feedback, the system develops increasingly robust representations that generalize better to novel domains and resist artifactual correlations. We demonstrate that our framework improves segmentation accuracy by up to 9 mIoU points (12-15\% relative improvement) on challenging cubemap data and yields 3-4$\times$ reductions in annotation effort compared to standard retraining, while maintaining competitive performance on benchmark datasets. This work provides a practical framework for researchers and practitioners seeking to build segmentation systems that are accurate, robust to dataset biases, data-efficient, and adaptable to real-world domains such as urban climate monitoring and autonomous driving.[143] A Multi-Strategy Framework for Enhancing Shatian Pomelo Detection in Real-World Orchards
Pan Wang,Yihao Hu,Xiaodong Bai,Aiping Yang,Xiangxiang Li,Meiping Ding,Jianguo Yao
Main category: cs.CV
TL;DR: 本文提出了一种多策略框架用于沙田柚的自动检测,通过构建多场景数据集、应用光照增强方法以及设计REAS-Det网络(包含RFAConv、C3RFEM和MultiSEAM模块)来应对设备差异、光照变化、尺度变化和遮挡问题,在真实果园环境中实现了较高的检测精度。
Details
Motivation: 现有方法在特定理论或数据集下表现良好,但在实际果园环境中因成像设备、光照条件、果实尺度变化和遮挡等因素导致性能下降,难以满足沙田柚自动化检测的商业需求。 Method: 1) 构建融合真实果园与网络图像的多场景数据集STP-AgriData;2) 采用对比度与亮度调整等数据增强策略模拟复杂光照;3) 设计REAS-Det网络,引入RFAConv和C3RFEM模块以增强感受野应对尺度变化,结合MultiSEAM多头特征选择结构和soft-NMS提升遮挡情况下的检测能力。 Result: 实验结果达到87.6%的精确率、74.9%的召回率、82.8%的mAP@.50和53.3%的mAP@.50:.95,优于多种主流目标检测方法。 Conclusion: 所提出的多策略框架有效提升了复杂环境下沙田柚的检测精度,具备较强的鲁棒性和应用前景,适用于农业自动化生产中的果实检测任务。 Abstract: As a specialty agricultural product with a large market scale, Shatian pomelo necessitates the adoption of automated detection to ensure accurate quantity and meet commercial demands for lean production. Existing research often involves specialized networks tailored for specific theoretical or dataset scenarios, but these methods tend to degrade performance in real-world. Through analysis of factors in this issue, this study identifies four key challenges that affect the accuracy of Shatian pomelo detection: imaging devices, lighting conditions, object scale variation, and occlusion. To mitigate these challenges, a multi-strategy framework is proposed in this paper. Firstly, to effectively solve tone variation introduced by diverse imaging devices and complex orchard environments, we utilize a multi-scenario dataset, STP-AgriData, which is constructed by integrating real orchard images with internet-sourced data. Secondly, to simulate the inconsistent illumination conditions, specific data augmentations such as adjusting contrast and changing brightness, are applied to the above dataset. Thirdly, to address the issues of object scale variation and occlusion in fruit detection, an REAS-Det network is designed in this paper. For scale variation, RFAConv and C3RFEM modules are designed to expand and enhance the receptive fields. For occlusion variation, a multi-scale, multi-head feature selection structure (MultiSEAM) and soft-NMS are introduced to enhance the handling of occlusion issues to improve detection accuracy. The results of these experiments achieved a precision(P) of 87.6%, a recall (R) of 74.9%, a mAP@.50 of 82.8%, and a mAP@.50:.95 of 53.3%. Our proposed network demonstrates superior performance compared to other state-of-the-art detection methods.[144] J-RAS: Enhancing Medical Image Segmentation via Retrieval-Augmented Joint Training
Salma J. Ahmed,Emad A. Mohammed,Azam Asilian Bidgoli
Main category: cs.CV
TL;DR: 本文提出了一种名为联合检索增强分割(J-RAS)的方法,通过联合训练分割模型和检索模型,利用检索到的图像-掩码对提升医学图像分割性能,在多个基准模型和数据集上均实现了显著且一致的改进。
Details
Motivation: 传统人工分割耗时、成本高且存在主观差异,而现有的AI方法依赖大量标注数据且泛化能力差,难以应对患者间差异和罕见病例,因此需要一种更高效、鲁棒的自动分割方法。 Method: 提出J-RAS框架,将分割模型与检索模型进行联合优化:分割模型利用检索到的相似图像-掩码对增强解剖结构理解,检索模型则学习与分割任务相关的特征而不仅是视觉相似性,从而实现相互促进的协同训练。 Result: 在ACDC和M&Ms两个数据集上验证了J-RAS的有效性,其在U-Net、TransUNet、SAM和SegFormer等多种分割模型上均带来性能提升。例如在ACDC数据集上,SegFormer结合J-RAS后平均Dice分数从0.8708提升至0.9115,平均Hausdorff距离从1.8130降至1.1489。 Conclusion: J-RAS通过联合优化分割与检索模型,有效提升了医学图像分割的精度和鲁棒性,具有良好的通用性和应用潜力,尤其适用于标注数据稀缺的场景。 Abstract: Image segmentation, the process of dividing images into meaningful regions, is critical in medical applications for accurate diagnosis, treatment planning, and disease monitoring. Although manual segmentation by healthcare professionals produces precise outcomes, it is time-consuming, costly, and prone to variability due to differences in human expertise. Artificial intelligence (AI)-based methods have been developed to address these limitations by automating segmentation tasks; however, they often require large, annotated datasets that are rarely available in practice and frequently struggle to generalize across diverse imaging conditions due to inter-patient variability and rare pathological cases. In this paper, we propose Joint Retrieval Augmented Segmentation (J-RAS), a joint training method for guided image segmentation that integrates a segmentation model with a retrieval model. Both models are jointly optimized, enabling the segmentation model to leverage retrieved image-mask pairs to enrich its anatomical understanding, while the retrieval model learns segmentation-relevant features beyond simple visual similarity. This joint optimization ensures that retrieval actively contributes meaningful contextual cues to guide boundary delineation, thereby enhancing the overall segmentation performance. We validate J-RAS across multiple segmentation backbones, including U-Net, TransUNet, SAM, and SegFormer, on two benchmark datasets: ACDC and M&Ms, demonstrating consistent improvements. For example, on the ACDC dataset, SegFormer without J-RAS achieves a mean Dice score of 0.8708$\pm$0.042 and a mean Hausdorff Distance (HD) of 1.8130$\pm$2.49, whereas with J-RAS, the performance improves substantially to a mean Dice score of 0.9115$\pm$0.031 and a mean HD of 1.1489$\pm$0.30. These results highlight the method's effectiveness and its generalizability across architectures and datasets.[145] Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making
Fan Zuo,Donglin Zhou,Jingqin Gao,Kaan Ozbay
Main category: cs.CV
TL;DR: 本研究提出了一种基于现有交通摄像头和AI技术的端到端框架,用于大规模、高分辨率的交通监测。通过微调YOLOv11模型、图基视角归一化方法和领域大语言模型,实现了对纽约市拥堵收费政策实施初期的长期交通分析。
Details
Motivation: 由于传感器部署成本高,传统交通监测手段受限;而现有视频分析技术在动态摄像头视角和海量数据处理方面存在不足,亟需一种低成本、可扩展的解决方案。 Method: 采用微调的YOLOv11模型实时提取多模态交通密度与分类指标;提出基于图的视角归一化方法以应对非固定PTZ摄像头带来的不一致性;集成领域大语言模型对24/7视频流产生的海量数据进行自动化摘要生成。 Result: 基于900万张图像(来自约1000个摄像头)的验证显示:工作日拥堵缓解区内乘用车密度下降9%;货车流量初期减少但有回升迹象;行人和骑行活动在走廊和区域尺度上持续增加;基于示例的提示显著提升了大语言模型的数值准确率并减少了幻觉现象。 Conclusion: 该框架展示了利用现有摄像头基础设施实现低人工干预、政策相关的大规模交通监测的可行性,具备实际应用与推广价值。 Abstract: Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City's first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM's numerical accuracy and reduced hallucinations. These findings demonstrate the framework's potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.[146] FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering
Lishen Qu,Zhihao Liu,Jinshan Pan,Shihao Zhou,Jinglei Shi,Duosheng Chen,Jufeng Yang
Main category: cs.CV
TL;DR: 提出一种基于物理原理的镜头光晕数据生成方法,构建包含2D和3D视角的混合数据集FlareX,并设计掩码方法评估真实场景下的去光晕效果。
Details
Motivation: 现有合成镜头光晕数据集缺乏多样性且忽略物理规律,导致模型在真实场景中泛化能力差。 Method: 提出三阶段物理感知生成方法:参数化模板创建、光照感知的2D合成规则、基于物理引擎的3D渲染,构建FlareX混合数据集,并设计掩码策略获取真实无光晕图像。 Result: 构建了包含9,500个2D模板和3,000对3D渲染图像的数据集FlareX,实验表明该方法生成的数据能有效提升模型在真实场景中的去光晕性能。 Conclusion: 所提出的物理感知合成方法和FlareX数据集显著提升了去光晕模型在真实场景下的泛化能力和评估可靠性。 Abstract: Lens flare occurs when shooting towards strong light sources, significantly degrading the visual quality of images. Due to the difficulty in capturing flare-corrupted and flare-free image pairs in the real world, existing datasets are typically synthesized in 2D by overlaying artificial flare templates onto background images. However, the lack of flare diversity in templates and the neglect of physical principles in the synthesis process hinder models trained on these datasets from generalizing well to real-world scenarios. To address these challenges, we propose a new physics-informed method for flare data generation, which consists of three stages: parameterized template creation, the laws of illumination-aware 2D synthesis, and physical engine-based 3D rendering, which finally gives us a mixed flare dataset that incorporates both 2D and 3D perspectives, namely FlareX. This dataset offers 9,500 2D templates derived from 95 flare patterns and 3,000 flare image pairs rendered from 60 3D scenes. Furthermore, we design a masking approach to obtain real-world flare-free images from their corrupted counterparts to measure the performance of the model on real-world images. Extensive experiments demonstrate the effectiveness of our method and dataset.[147] BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes
Lishen Qu,Zhihao Liu,Shihao Zhou,Yaqi Luo,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
Main category: cs.CV
TL;DR: 本文提出了BurstDeflicker,一个用于去闪烁研究的可扩展基准数据集,结合合成、真实场景捕捉和绿幕方法生成多样化的带闪烁图像,以推动该领域发展。
Details
Motivation: 由于缺乏大规模、真实感强的闪烁数据集,闪烁去除研究进展受限,本文旨在构建一个高质量、多样化的数据集以促进相关研究。 Method: 采用三种互补策略:基于Retinex的合成方法控制闪烁属性;采集4000张真实闪烁图像;使用绿幕技术引入动态场景中的运动同时保留真实闪烁。 Result: 实验表明所构建的数据集能有效提升模型对真实闪烁的理解与泛化能力,并推动去闪烁算法的发展。 Conclusion: BurstDeflicker是一个实用且可扩展的去闪烁基准,为未来研究提供了重要资源。 Abstract: Flicker artifacts in short-exposure images are caused by the interplay between the row-wise exposure mechanism of rolling shutter cameras and the temporal intensity variations of alternating current (AC)-powered lighting. These artifacts typically appear as uneven brightness distribution across the image, forming noticeable dark bands. Beyond compromising image quality, this structured noise also affects high-level tasks, such as object detection and tracking, where reliable lighting is crucial. Despite the prevalence of flicker, the lack of a large-scale, realistic dataset has been a significant barrier to advancing research in flicker removal. To address this issue, we present BurstDeflicker, a scalable benchmark constructed using three complementary data acquisition strategies. First, we develop a Retinex-based synthesis pipeline that redefines the goal of flicker removal and enables controllable manipulation of key flicker-related attributes (e.g., intensity, area, and frequency), thereby facilitating the generation of diverse flicker patterns. Second, we capture 4,000 real-world flicker images from different scenes, which help the model better understand the spatial and temporal characteristics of real flicker artifacts and generalize more effectively to wild scenarios. Finally, due to the non-repeatable nature of dynamic scenes, we propose a green-screen method to incorporate motion into image pairs while preserving real flicker degradation. Comprehensive experiments demonstrate the effectiveness of our dataset and its potential to advance research in flicker removal.[148] MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output
Yanyuan Chen,Dexuan Xu,Yu Huang,Songkun Zhan,Hanpin Wang,Dongxue Chen,Xueping Wang,Meikang Qiu,Hang Li
Main category: cs.CV
TL;DR: 提出了一种统一的医疗视觉语言模型MIMO,具有视觉指向的多模态输入和像素定位的多模态输出,并构建了包含89.5万样本的大规模数据集MIMOSeg,实验证明MIMO在结合视觉线索与文本指令理解复杂医学图像方面表现优异。
Details
Motivation: 现有医疗视觉语言模型仅依赖文本指令输入且输出缺乏与图像关键区域的关联,导致对复杂医学图像的理解和解释能力受限。 Method: 提出MIMO模型,支持视觉指向的多模态输入和像素定位的多模态输出;构建大规模医疗多模态数据集MIMOSeg,涵盖基础指令遵循和复杂问答任务。 Result: 在多个下游医疗多模态任务上的实验表明,MIMO具备独特的视觉指向与像素定位能力,优于先前模型。 Conclusion: MIMO通过融合视觉与文本信息,实现了对医学图像和术语的精准理解与定位,推动了医疗视觉语言模型的发展。 Abstract: Currently, medical vision language models are widely used in medical vision question answering tasks. However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. MIMO can not only combine visual clues and textual instructions to understand complex medical images and semantics, but can also ground medical terminologies in textual output within the image. To overcome the scarcity of relevant data in the medical field, we propose MIMOSeg, a comprehensive medical multimodal dataset including 895K samples. MIMOSeg is constructed from four different perspectives, covering basic instruction following and complex question answering with multimodal input and multimodal output. We conduct experiments on several downstream medical multimodal tasks. Extensive experimental results verify that MIMO can uniquely combine visual referring and pixel grounding capabilities, which are not available in previous models.[149] Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Junan Chen,Trung Thanh Nguyen,Takahiro Komamizu,Ichiro Ide
Main category: cs.CV
TL;DR: 本文提出了一种轻量级视觉适配器模块Q-Adapter,用于在视频描述任务中实现多模态大模型的高效微调。该方法通过在视觉编码器中引入可学习查询令牌和门控层,有效提取稀疏且与描述相关的特征,在仅使用1.4%参数的情况下,在多个基准数据集上达到了最先进的性能。
Details
Motivation: 现有的参数高效微调(PEFT)方法主要关注多模态大语言模型的语言组件,忽视了视觉信息的有效利用,限制了其在视频描述等多模态任务中的表现。因此,需要一种能够增强视觉微调能力的轻量级适配方法。 Method: 提出Q-Adapter,在视觉编码器中引入可学习的查询令牌和门控机制,以提取与生成描述最相关的稀疏视觉特征,并实现端到端的参数高效微调,无需依赖外部文本监督。 Result: 在MSR-VTT和MSVD两个主流视频描述数据集上,Q-Adapter在BLEU@4、METEOR、ROUGE-L和CIDEr等指标上均优于现有PEFT方法,并与全微调方法性能相当,但仅需1.4%的可调参数。 Conclusion: Q-Adapter显著提升了多模态大模型在视频描述任务中的参数效率与性能平衡,验证了其在视频-语言建模中的可扩展性,为未来高效多模态微调提供了新思路。 Abstract: Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard "pre-training followed by fine-tuning" paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.[150] P-4DGS: Predictive 4D Gaussian Splatting with 90$\times$ Compression
Henan Wang,Hanxin Zhu,Xinliang Gong,Tianyu He,Xin Li,Zhibo Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为P-4DGS的新型动态3D高斯点阵表示方法,用于高效压缩4D场景建模。通过借鉴视频压缩中的帧内和帧间预测技术,该方法利用时空相关性并结合自适应量化与基于上下文的熵编码,显著降低了存储开销,在合成和真实场景中分别实现了最高达40倍和90倍的压缩比,同时保持高质量重建和实时渲染速度。
Details
Motivation: 现有动态3D高斯点阵算法忽略了动态场景中存在的时空冗余,导致内存消耗过高,难以实际部署。因此需要一种更紧凑的表示方法以提升存储与计算效率。 Method: 提出基于3D锚点的时空预测模块,利用高斯图元间的时空相关性;采用自适应量化策略和基于上下文的熵编码进一步压缩锚点数据。 Result: 在多个合成与真实数据集上验证了P-4DGS的率失真性能,平均存储仅约1MB,压缩比达40倍(合成)和90倍(真实场景),且具有最快的渲染速度和优异的重建质量。 Conclusion: P-4DGS通过引入视频压缩思想,有效减少了动态3DGS的存储需求,在保持高性能渲染的同时实现了显著的压缩增益,为大规模动态场景建模提供了可行方案。 Abstract: 3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance, especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D reconstruction). However, despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes, leading to prohibitive memory consumption. To address this, we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling. Inspired by intra- and inter-frame prediction techniques commonly used in video compression, we first design a 3D anchor point-based spatial-temporal prediction module to fully exploit the spatial-temporal correlations across different 3D Gaussian primitives. Subsequently, we employ an adaptive quantization strategy combined with context-based entropy coding to further reduce the size of the 3D anchor points, thereby achieving enhanced compression efficiency. To evaluate the rate-distortion performance of our proposed P-4DGS in comparison with other dynamic 3DGS representations, we conduct extensive experiments on both synthetic and real-world datasets. Experimental results demonstrate that our approach achieves state-of-the-art reconstruction quality and the fastest rendering speed, with a remarkably low storage footprint (around \textbf{1MB} on average), achieving up to \textbf{40$\times$} and \textbf{90$\times$} compression on synthetic and real-world scenes, respectively.[151] Complementary and Contrastive Learning for Audio-Visual Segmentation
Sitong Gong,Yunzhi Zhuge,Lu Zhang,Pingping Zhang,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了CCFormer,一种用于音频-视觉分割的新框架,通过早期融合模块、多查询Transformer模块和双模态对比学习,有效结合局部与全局信息,全面捕捉时空上下文,在多个基准数据集上实现了最先进的性能。
Details
Motivation: 传统CNN方法受限于局部感受野,难以充分建模跨模态交互;现有Transformer方法在提取多模态系数和时间动态方面仍存在不足,需更有效地融合音频与视觉信息并捕捉时序一致性。 Method: 提出CCFormer框架:1)早期融合模块(EIM)采用并行双向结构融合多尺度视觉特征与音频数据;2)多查询Transformer模块(MTM)赋予音频查询动态学习能力,同时建模帧级与视频级关系;3)双模态对比学习(BCL)促进统一特征空间中的模态对齐。 Result: 在S4、MS3和AVSS三个主流音频-视觉分割数据集上均取得新的最先进性能,验证了方法的有效性与泛化能力。 Conclusion: CCFormer通过增强跨模态互补性、建模时空上下文及模态对齐,显著提升了音频-视觉分割性能,为该领域提供了有效的解决方案。 Abstract: Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer[152] Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
Kaitao Chen,Shaohao Rui,Yankai Jiang,Jiamin Wu,Qihao Zheng,Chunfeng Song,Xiaosong Wang,Mu Zhou,Mianxin Liu
Main category: cs.CV
TL;DR: ViTAR是一种新型医学视觉-语言模型框架,通过模拟人类专家的“思考-行动-再思考-回答”认知链,实现多步交互式视觉推理,提升医学图像诊断的准确性和可信度。
Details
Motivation: 现有的医学视觉-语言模型依赖单次推理,忽视局部视觉线索,与临床专家逐步聚焦和调整的诊断过程存在差距,因此需要一种能模仿人类迭代思维过程的模型。 Method: 提出ViTAR框架,将医学图像视为可交互对象,支持多步视觉推理;构建包含1K交互样本的指令数据集和16K细粒度视觉问答数据集;采用两阶段训练策略:先进行监督微调引导认知轨迹,再通过强化学习优化决策。 Result: 实验表明ViTAR优于当前最先进的模型;注意力分析显示其在推理过程中逐渐聚焦于关键临床区域,并持续保持对视觉标记的高度关注。 Conclusion: 将专家风格的迭代思维链嵌入视觉-语言模型,不仅能提升医学AI的性能,还能增强其可解释性与可信度。 Abstract: Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.[153] DREAM: A Benchmark Study for Deepfake REalism AssessMent
Bo Peng,Zichuan Wang,Sheng Yu,Xiaochuan Jin,Wei Wang,Jing Dong
Main category: cs.CV
TL;DR: 本文提出了一个名为DREAM的综合基准,用于评估深度伪造视频的视觉真实感,旨在模拟人类对深伪视频的主观感知,并推动该领域的研究。
Details
Motivation: 现有研究主要关注深度伪造的客观检测,而对其主观感知(尤其是视觉真实感)的计算建模和模拟缺乏充分研究。 Method: 构建了一个包含多样化质量深伪视频的数据集,收集了来自3500名标注者的14万条真实感评分和文本描述,并对16种代表性真实感评估方法进行了全面评估,包括基于大模型的视觉语言模型和新提出的描述对齐CLIP方法。 Result: 建立了DREAM基准,包含数据集、大规模人工标注和多种方法的评测结果,揭示了当前方法在模拟人类感知方面的差距与挑战。 Conclusion: 该研究为深伪视频视觉真实感评估提供了重要基础,有助于评估深伪的影响并改进生成过程,推动相关领域发展。 Abstract: Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.[154] Collaborative Learning of Semantic-Aware Feature Learning and Label Recovery for Multi-Label Image Recognition with Incomplete Labels
Zhi-Fen He,Ren-Dong Xie,Bo Li,Bin Liu,Jin-Yan Hu
Main category: cs.CV
TL;DR: 提出了一种用于不完整标签多标签图像识别的协同学习框架CLSL,统一了语义感知特征学习与缺失标签恢复。
Details
Motivation: 解决多标签图像识别中语义感知特征学习和缺失标签恢复两大核心挑战。 Method: 设计语义相关特征学习模块、语义引导特征增强模块,并构建协同学习框架,实现特征学习与标签恢复的相互促进。 Result: 在MS-COCO、VOC2007和NUS-WIDE三个数据集上实验表明,CLSL优于现有最先进方法。 Conclusion: CLSL通过协同学习机制有效提升了不完整标签下的多标签图像识别性能。 Abstract: Multi-label image recognition with incomplete labels is a critical learning task and has emerged as a focal topic in computer vision. However, this task is confronted with two core challenges: semantic-aware feature learning and missing label recovery. In this paper, we propose a novel Collaborative Learning of Semantic-aware feature learning and Label recovery (CLSL) method for multi-label image recognition with incomplete labels, which unifies the two aforementioned challenges into a unified learning framework. More specifically, we design a semantic-related feature learning module to learn robust semantic-related features by discovering semantic information and label correlations. Then, a semantic-guided feature enhancement module is proposed to generate high-quality discriminative semantic-aware features by effectively aligning visual and semantic feature spaces. Finally, we introduce a collaborative learning framework that integrates semantic-aware feature learning and label recovery, which can not only dynamically enhance the discriminability of semantic-aware features but also adaptively infer and recover missing labels, forming a mutually reinforced loop between the two processes. Extensive experiments on three widely used public datasets (MS-COCO, VOC2007, and NUS-WIDE) demonstrate that CLSL outperforms the state-of-the-art multi-label image recognition methods with incomplete labels.[155] Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning
Pîrvu Mihai-Cristian,Leordeanu Marius
Main category: cs.CV
TL;DR: 本文提出了PHG-MAE,一种结合神经图与掩码自编码器的新型模型,通过模态级随机掩码和统一训练流程提升多模态视觉任务性能,并支持推理集成与知识蒸馏,适用于无人机场景等多模态MTL任务。
Details
Motivation: 旨在解决多模态视觉任务中缺乏有效整合神经图与掩码自编码器框架的问题,同时减少对标注数据的依赖。 Method: 提出概率超图掩码自编码器(PHG-MAE),在前向传播中对整个模态进行随机掩码以采样超边分布,将预训练与微调融合为单一训练循环,并支持推理时集成与知识蒸馏。 Result: 模型在无人机场景下表现出色,能有效提升预测性能与一致性,且小规模模型(<1M参数)经知识蒸馏后性能损失小;发布了自动化扩展的Dronescapes数据集及完整代码。 Conclusion: PHG-MAE为多模态多任务学习提供了一个高效、可扩展的统一框架,具有广泛应用于自动驾驶、室内机器人等领域的潜力。 Abstract: The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.[156] Tracking the Spatiotemporal Evolution of Landslide Scars Using a Vision Foundation Model: A Novel and Universal Framework
Meijun Zhou,Gang Mei,Zhengjing Ma,Nengxiong Xu,Jianbing Peng
Main category: cs.CV
TL;DR: 提出一种基于视觉基础模型的通用框架,通过将离散光学遥感影像重构为连续视频序列,实现对大型滑坡伤痕时空演化的连续追踪。
Details
Motivation: 现有研究多集中于单相或灾前灾后双相滑坡识别,难以连续追踪滑坡伤痕的时空演化,限制了对滑坡演化机制和早期预警前兆的理解。 Method: 构建一个知识引导、自动传播与交互式优化的框架,利用视觉基础模型对重构的连续视频序列进行分割,从而实现滑坡伤痕的持续识别与追踪。 Result: 在白格滑坡和赛拉滑坡(2017–2025)两个案例中验证了该框架的有效性,能够连续追踪滑坡伤痕,捕捉到失效前兆及灾后演化过程。 Conclusion: 所提框架可有效实现大型滑坡伤痕的连续时空追踪,有助于提升早期预警能力以及对次生灾害和长期稳定性的评估。 Abstract: Tracking the spatiotemporal evolution of large-scale landslide scars is critical for understanding the evolution mechanisms and failure precursors, enabling effective early-warning. However, most existing studies have focused on single-phase or pre- and post-failure dual-phase landslide identification. Although these approaches delineate post-failure landslide boundaries, it is challenging to track the spatiotemporal evolution of landslide scars. To address this problem, this study proposes a novel and universal framework for tracking the spatiotemporal evolution of large-scale landslide scars using a vision foundation model. The key idea behind the proposed framework is to reconstruct discrete optical remote sensing images into a continuous video sequence. This transformation enables a vision foundation model, which is developed for video segmentation, to be used for tracking the evolution of landslide scars. The proposed framework operates within a knowledge-guided, auto-propagation, and interactive refinement paradigm to ensure the continuous and accurate identification of landslide scars. The proposed framework was validated through application to two representative cases: the post-failure Baige landslide and the active Sela landslide (2017-2025). Results indicate that the proposed framework enables continuous tracking of landslide scars, capturing both failure precursors critical for early warning and post-failure evolution essential for assessing secondary hazards and long-term stability.[157] Gesplat: Robust Pose-Free 3D Reconstruction via Geometry-Guided Gaussian Splatting
Jiahui Lu,Haihong Xiao,Xueyan Zhao,Wenxiong Kang
Main category: cs.CV
TL;DR: 本文提出Gesplat,一种基于3D高斯点阵的框架,能够在无姿态且稀疏视角下实现鲁棒的新视角合成与几何一致的三维重建。
Details
Motivation: NeRF和3DGS在三维重建中依赖精确相机姿态和密集视角,限制了其在稀疏视角下的应用。因此需要一种不依赖准确位姿估计的方法。 Method: 利用VGGT基础模型获取更可靠的初始姿态和稠密点云;采用混合高斯表示与双位置-形状优化、图引导属性细化模块以及基于光流的深度正则化方法。 Result: 在前向视角和大规模复杂数据集上,相比其他无姿态方法表现出更优的定性和定量结果。 Conclusion: Gesplat在稀疏、无姿态图像条件下实现了鲁棒的新视角合成与几何一致性重建,显著提升了在低监督条件下的性能。 Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have advanced 3D reconstruction and novel view synthesis, but remain heavily dependent on accurate camera poses and dense viewpoint coverage. These requirements limit their applicability in sparse-view settings, where pose estimation becomes unreliable and supervision is insufficient. To overcome these challenges, we introduce Gesplat, a 3DGS-based framework that enables robust novel view synthesis and geometrically consistent reconstruction from unposed sparse images. Unlike prior works that rely on COLMAP for sparse point cloud initialization, we leverage the VGGT foundation model to obtain more reliable initial poses and dense point clouds. Our approach integrates several key innovations: 1) a hybrid Gaussian representation with dual position-shape optimization enhanced by inter-view matching consistency; 2) a graph-guided attribute refinement module to enhance scene details; and 3) flow-based depth regularization that improves depth estimation accuracy for more effective supervision. Comprehensive quantitative and qualitative experiments demonstrate that our approach achieves more robust performance on both forward-facing and large-scale complex datasets compared to other pose-free methods.[158] Cooperative Pseudo Labeling for Unsupervised Federated Classification
Kuangpu Guo,Lijun Sheng,Yongcan Yu,Jian Liang,Zilei Wang,Ran He
Main category: cs.CV
TL;DR: 本文首次将无监督联邦学习(UFL)扩展到基于CLIP的分类任务,提出一种名为FedCoPL的新方法,通过服务器端调整伪标签分布以缓解类别不平衡,并设计部分提示聚合协议实现有效协作与个性化。
Details
Motivation: 现有UFL研究主要集中于表征学习和聚类任务,而借助CLIP等视觉语言模型的强大零样本能力,原本难以实现的分类任务成为可能,但尚未被充分探索。因此,本文旨在填补UFL在分类任务上的空白。 Method: 提出FedCoPL方法:客户端上传伪标签分布,服务器进行调整再下发以平衡全局类别;采用部分提示聚合策略,视觉提示在服务器聚合以共享通用特征,文本提示保留在本地以保留个性化知识。 Result: 大量实验表明,FedCoPL在多个基准上显著优于基线方法,展现出优异的分类性能和模型协作效果。 Conclusion: 本文成功将UFL应用于基于CLIP的分类任务,提出的FedCoPL方法通过伪标签优化和部分提示聚合机制,有效提升了无监督联邦分类的性能,为未来研究提供了新方向。 Abstract: Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.[159] Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models
Minbin Huang,Runhui Huang,Chuanyang Zheng,Jingyao Li,Guoxuan Chen,Han Shi,Hong Cheng
Main category: cs.CV
TL;DR: 提出了一种新的强化学习方法ACRE,通过一致性验证奖励机制提升多模态大模型在视频和数学推理任务中推理链与答案的一致性及准确性。
Details
Motivation: 现有的强化学习方法(如GRPO)虽然能提高答案准确率,但可能导致推理过程与最终答案不一致,影响模型可靠性。 Method: 在GRPO基础上引入辅助一致性检查:生成推理链和初始答案后,打乱选项顺序并用相同推理链重新预测答案,设计一致性验证奖励,仅当两次答案一致且正确时给予高奖励。 Result: 在多个视频推理和多模态数学推理基准上,ACRE相比GRPO基线平均提升了2.2%(视频)和1.5%(数学)的性能。 Conclusion: ACRE有效缓解了推理链与答案之间的脱节问题,增强了模型的稳定性和可信赖性。 Abstract: Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7\% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2\% and 1.5\% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.[160] Uncertainty-Aware Post-Detection Framework for Enhanced Fire and Smoke Detection in Compact Deep Learning Models
Aniruddha Srinivas Joshi,Godwyn James William,Shreyas Srinivas Joshi
Main category: cs.CV
TL;DR: 提出一种基于不确定性感知的后检测框架,通过结合统计不确定性和视觉线索(如颜色、边缘和纹理)来重新校准检测置信度,提升紧凑模型在火灾烟雾检测中的准确性和鲁棒性。
Details
Motivation: 现有基于视觉的火灾烟雾检测方法在效率与可靠性之间难以平衡,小型化模型易产生误检和漏检,传统NMS类方法仅依赖空间重叠,无法有效处理复杂场景下的检测结果。 Method: 设计轻量级置信度优化网络(Confidence Refinement Network),在不修改基础模型的前提下,融合模型预测的不确定性估计与领域相关的颜色、边缘和纹理特征,对YOLO系列模型的检测置信度进行重评分。 Result: 在D-Fire数据集上实验表明,该方法相比传统NMS及基线模型显著提升了精度、召回率和平均精度(mAP),且计算开销较小。 Conclusion: 所提出的不确定性感知后检测框架能有效增强紧凑深度学习模型在真实场景中对火灾和烟雾检测的鲁棒性,适用于无人机、监控系统等资源受限设备。 Abstract: Accurate fire and smoke detection is critical for safety and disaster response, yet existing vision-based methods face challenges in balancing efficiency and reliability. Compact deep learning models such as YOLOv5n and YOLOv8n are widely adopted for deployment on UAVs, CCTV systems, and IoT devices, but their reduced capacity often results in false positives and missed detections. Conventional post-detection methods such as Non-Maximum Suppression and Soft-NMS rely only on spatial overlap, which can suppress true positives or retain false alarms in cluttered or ambiguous fire scenes. To address these limitations, we propose an uncertainty aware post-detection framework that rescales detection confidences using both statistical uncertainty and domain relevant visual cues. A lightweight Confidence Refinement Network integrates uncertainty estimates with color, edge, and texture features to adjust detection scores without modifying the base model. Experiments on the D-Fire dataset demonstrate improved precision, recall, and mean average precision compared to existing baselines, with only modest computational overhead. These results highlight the effectiveness of post-detection rescoring in enhancing the robustness of compact deep learning models for real-world fire and smoke detection.[161] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Rui Chen,Bin Liu,Changtao Miao,Xinghao Wang,Yi Li,Tao Gong,Qi Chu,Nenghai Yu
Main category: cs.CV
TL;DR: 提出了一种无需训练的图像篡改定位框架ICFC,利用多模态大语言模型实现可解释的像素级定位,在多个基准上优于现有方法。
Details
Motivation: 现有的弱监督或无训练方法在图像篡改定位中性能不足且缺乏可解释性,需要一种更高效、无需标注的方法。 Method: 通过构建对象化规则和自适应过滤建立可靠知识库,并设计多步渐进推理流程,利用多模态大语言模型进行从粗到细的推理。 Result: ICFC在多个基准上超越了最先进的无训练方法,并在弱监督和全监督方法中表现出竞争性或更优的性能。 Conclusion: ICFC为图像篡改定位提供了一种高效、可解释且无需训练的新范式,展示了多模态大模型在数字取证中的潜力。 Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.[162] ImmerIris: A Large-Scale Dataset and Benchmark for Immersive Iris Recognition in Open Scenes
Yuxi Mi,Qiuyang Yuan,Zhizhou Zhong,Xuan Zhao,Jiaogen Zhou,Fubao Zhu,Jihong Guan,Shuigeng Zhou
Main category: cs.CV
TL;DR: 本文提出了ImmerIris,这是目前最大的面向沉浸式离轴虹膜识别的公开数据集,包含来自564名受试者的499,791张眼部图像,并提出了一种无需归一化的新型识别范式,显著提升了在复杂场景下的识别性能。
Details
Motivation: 现有的虹膜识别数据集主要针对正轴采集场景,而沉浸式应用中的离轴虹膜识别面临视角畸变、质量退化和类内差异等新挑战,缺乏相应的数据支持和有效方法。 Method: 通过VR头显采集大规模离轴虹膜图像构建ImmerIris数据集,并设计评估协议;提出一种无需归一化的直接学习方法,减少对传统归一化步骤的依赖。 Result: 现有基于归一化的方法在离轴场景下表现不佳,而所提出的无需归一化方法在ImmerIris上显著优于传统方法,展现出更强的鲁棒性。 Conclusion: 无需归一化的学习范式为沉浸式环境下的虹膜识别提供了更有效的解决方案,标志着向实用化离轴虹膜识别迈出了重要一步。 Abstract: In egocentric applications such as augmented and virtual reality, immersive iris recognition is emerging as an accurate and seamless way to identify persons. While classic systems acquire iris images on-axis, i.e., via dedicated frontal sensors in controlled settings, the immersive setup primarily captures off-axis irises through tilt-placed headset cameras, with only mild control in open scenes. This yields unique challenges, including perspective distortion, intensified quality degradations, and intra-class variations in iris texture. Datasets capturing these challenges remain scarce. To fill this gap, this paper introduces ImmerIris, a large-scale dataset collected via VR headsets, containing 499,791 ocular images from 564 subjects. It is, to the best of current knowledge, the largest public dataset and among the first dedicated to off-axis acquisition. Based on ImmerIris, evaluation protocols are constructed to benchmark recognition methods under different challenging factors. Current methods, primarily designed for classic on-axis imagery, perform unsatisfactorily on the immersive setup, mainly due to reliance on fallible normalization. To this end, this paper further proposes a normalization-free paradigm that directly learns from ocular images with minimal adjustment. Despite its simplicity, this approach consistently outperforms normalization-based counterparts, pointing to a promising direction for robust immersive recognition.[163] Multi Class Parkinsons Disease Detection Based on Finger Tapping Using Attention-Enhanced CNN BiLSTM
Abu Saleh Musa Miah,Najmul Hassan,Md Maruf Al Hossain,Yuichi Okuyama,Jungpil Shin
Main category: cs.CV
TL;DR: 提出了一种基于手指敲击视频的多分类帕金森病(PD)严重程度检测系统,结合CNN、BiLSTM和注意力机制,有效提升了自动PD严重度识别的准确性。
Details
Motivation: 现有基于手势的帕金森病识别系统准确率不足,亟需更精确的非侵入式方法来支持临床评估和干预。 Method: 采集手指敲击视频,提取手腕和手部运动的时域、频域和幅值特征;构建融合CNN、BiLSTM和注意力机制的混合深度学习模型,利用Conv1D-MaxPooling捕捉空间依赖,BiLSTM建模时间动态,注意力机制聚焦关键特征。 Result: 模型在区分五类PD严重程度时表现出色,验证了结合时空特征与注意力机制的有效性。 Conclusion: 该方法能有效提升自动化PD严重度检测性能,具有作为临床PD监测和进展跟踪工具的应用潜力。 Abstract: Effective clinical management and intervention development depend on accurate evaluation of Parkinsons disease (PD) severity. Many researchers have worked on developing gesture-based PD recognition systems; however, their performance accuracy is not satisfactory. In this study, we propose a multi-class Parkinson Disease detection system based on finger tapping using an attention-enhanced CNN BiLSTM. We collected finger tapping videos and derived temporal, frequency, and amplitude based features from wrist and hand movements. Then, we proposed a hybrid deep learning framework integrating CNN, BiLSTM, and attention mechanisms for multi-class PD severity classification from video-derived motion features. First, the input sequence is reshaped and passed through a Conv1D MaxPooling block to capture local spatial dependencies. The resulting feature maps are fed into a BiLSTM layer to model temporal dynamics. An attention mechanism focuses on the most informative temporal features, producing a context vector that is further processed by a second BiLSTM layer. CNN-derived features and attention-enhanced BiLSTM outputs are concatenated, followed by dense and dropout layers, before the final softmax classifier outputs the predicted PD severity level. The model demonstrated strong performance in distinguishing between the five severity classes, suggesting that integrating spatial temporal representations with attention mechanisms can improve automated PD severity detection, making it a promising non-invasive tool to support clinicians in PD monitoring and progression tracking.[164] DeepFusionNet: Autoencoder-Based Low-Light Image Enhancement and Super-Resolution
Halil Hüseyin Çalışkan,Talha Koruk
Main category: cs.CV
TL;DR: 本文提出了一种名为DeepFusionNet的轻量级架构,用于低光照图像增强和超分辨率重建,在保持较低参数量的同时实现了较高的SSIM和PSNR性能。
Details
Motivation: 现有低光图像增强方法通常参数多、计算开销大且重建质量有限(SSIM和PSNR较低),同时高分辨率图像恢复多依赖计算复杂的GAN模型,因此需要一种高效且性能优越的统一架构。 Method: 采用基于自动编码器的DeepFusionNet架构,分别用于低光图像增强和超分辨率重建;该模型参数量小,不依赖GAN,通过端到端训练实现图像质量提升。 Result: 在LOL-v1数据集上,低光增强任务达到92.8%的SSIM和26.30的PSNR;在超分辨率任务中,模型仅含约10万参数,取得25.30的PSNR和80.7%的SSIM。 Conclusion: DeepFusionNet在低光增强和超分辨率任务中均表现出色,在显著降低模型复杂度的同时保持了良好的图像恢复性能,适用于实时图像传输等资源受限场景。 Abstract: Computer vision and image processing applications suffer from dark and low-light images, particularly during real-time image transmission. Currently, low light and dark images are converted to bright and colored forms using autoencoders; however, these methods often achieve low SSIM and PSNR scores and require high computational power due to their large number of parameters. To address these challenges, the DeepFusionNet architecture has been developed. According to the results obtained with the LOL-v1 dataset, DeepFusionNet achieved an SSIM of 92.8% and a PSNR score of 26.30, while containing only approximately 2.5 million parameters. On the other hand, conversion of blurry and low-resolution images into high-resolution and blur-free images has gained importance in image processing applications. Unlike GAN-based super-resolution methods, an autoencoder-based super resolution model has been developed that contains approximately 100 thousand parameters and uses the DeepFusionNet architecture. According to the results of the tests, the DeepFusionNet based super-resolution method achieved a PSNR of 25.30 and a SSIM score of 80.7 percent according to the validation set.[165] YOLOv11-Litchi: Efficient Litchi Fruit Detection based on UAV-Captured Agricultural Imagery in Complex Orchard Environments
Hongxing Peng,Haopei Xie,Weijia Lia,Huanai Liuc,Ximing Li
Main category: cs.CV
TL;DR: 本文提出了一种基于UAV图像的轻量级鲁棒检测模型YOLOv11-Litchi,用于荔枝果实检测。该模型通过多尺度残差模块、轻量化特征融合方法和针对遮挡设计的检测头,有效解决了小目标、模型体积大和频繁遮挡等问题,在参数减少32.5%的同时提升了检测精度与实时性。
Details
Motivation: 传统人工选果方法难以满足现代荔枝生产需求,亟需一种高效、低成本的自动化检测方案。结合无人机航拍影像与深度学习技术可提升果园管理效率,但面临小目标检测难、模型部署成本高及果实遮挡严重等挑战。 Method: 基于YOLOv11框架,提出YOLOv11-Litchi模型:引入多尺度残差模块增强跨尺度上下文特征提取能力;采用轻量级特征融合方法降低模型参数量和计算开销;设计专门的荔枝遮挡检测头,通过突出目标区域、抑制背景干扰来缓解遮挡问题。 Result: YOLOv11-Litchi模型参数量为6.35 MB,比基准YOLOv11减少32.5%,mAP提升2.5%至90.1%,F1分数提高1.4%至85.5%,并实现57.2 FPS的帧率,满足实时检测需求。 Conclusion: YOLOv11-Litchi在复杂果园环境中表现出优异的检测性能和部署效率,适用于基于无人机的荔枝检测,具有在精准农业中广泛应用的潜力。 Abstract: Litchi is a high-value fruit, yet traditional manual selection methods are increasingly inadequate for modern production demands. Integrating UAV-based aerial imagery with deep learning offers a promising solution to enhance efficiency and reduce costs. This paper introduces YOLOv11-Litchi, a lightweight and robust detection model specifically designed for UAV-based litchi detection. Built upon the YOLOv11 framework, the proposed model addresses key challenges such as small target size, large model parameters hindering deployment, and frequent target occlusion. To tackle these issues, three major innovations are incorporated: a multi-scale residual module to improve contextual feature extraction across scales, a lightweight feature fusion method to reduce model size and computational costs while maintaining high accuracy, and a litchi occlusion detection head to mitigate occlusion effects by emphasizing target regions and suppressing background interference. Experimental results validate the model's effectiveness. YOLOv11-Litchi achieves a parameter size of 6.35 MB - 32.5% smaller than the YOLOv11 baseline - while improving mAP by 2.5% to 90.1% and F1-Score by 1.4% to 85.5%. Additionally, the model achieves a frame rate of 57.2 FPS, meeting real-time detection requirements. These findings demonstrate the suitability of YOLOv11-Litchi for UAV-based litchi detection in complex orchard environments, showcasing its potential for broader applications in precision agriculture.[166] Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer
Yecong Wan,Mingwen Shao,Renlong Wu,Wangmeng Zuo
Main category: cs.CV
TL;DR: 提出Color3D,一种可适应静态和动态3D场景的彩色化框架,通过单视图着色与个性化着色器实现跨视角及时序一致性,同时保持色彩丰富性与用户可控性。
Details
Motivation: 现有方法局限于静态场景且通过平均颜色牺牲了色彩多样性与可控性,缺乏对动态场景的支持和灵活的用户引导控制。 Method: 仅对单个关键视图进行着色,并微调一个个性化的着色器,利用其归纳偏置将颜色传播到新视角和时间步;采用Lab颜色空间高斯溅射表示进行3D重建。 Result: 在多个静态和动态3D着色基准上实验表明,该方法能生成更一致、色彩更丰富的渲染结果,并支持精确的用户控制。 Conclusion: Color3D成功将复杂的3D着色问题转化为更易处理的单图像范式,实现了高保真、多样化且可控的3D场景着色。 Abstract: In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. Project Page https://yecongwan.github.io/Color3D/.[167] Stroke Locus Net: Occluded Vessel Localization from MRI Modalities
Mohamed Hamad,Muhammad Khan,Tamer Khattab,Mohamed Mabrok
Main category: cs.CV
TL;DR: 本文提出了一种名为Stroke Locus Net的端到端深度学习方法,利用MRI扫描实现缺血性卒中中闭塞血管的检测、分割与定位。
Details
Motivation: 准确识别闭塞血管是缺血性卒中诊断的关键挑战,现有机器学习方法主要集中于病灶分割,缺乏对血管定位的研究。 Method: 结合nnUNet的分割分支进行病灶检测,使用动脉图谱进行血管映射与识别,并采用pGAN生成分支从MRI合成MRA图像。 Result: 在卒中患者的T1 MRI扫描上实现了有希望的闭塞血管定位结果。 Conclusion: 该方法有望提升卒中诊断的速度与准确性,为临床决策提供更充分的信息。 Abstract: A key challenge in ischemic stroke diagnosis using medical imaging is the accurate localization of the occluded vessel. Current machine learning methods in focus primarily on lesion segmentation, with limited work on vessel localization. In this study, we introduce Stroke Locus Net, an end-to-end deep learning pipeline for detection, segmentation, and occluded vessel localization using only MRI scans. The proposed system combines a segmentation branch using nnUNet for lesion detection with an arterial atlas for vessel mapping and identification, and a generation branch using pGAN to synthesize MRA images from MRI. Our implementation demonstrates promising results in localizing occluded vessels on stroke-affected T1 MRI scans, with potential for faster and more informed stroke diagnosis.[168] ReMix: Towards a Unified View of Consistent Character Generation and Editing
Benjia Zhou,Bin Fu,Pei Cheng,Yanru Wang,Jiayuan Fan,Tao Chen
Main category: cs.CV
TL;DR: 本文提出了ReMix,一个用于角色一致的图像生成与编辑的统一框架,结合ReMix模块和IP-ControlNet,在无需微调的情况下实现语义一致性与空间可控性。
Details
Motivation: 现有方法难以在单一框架下同时实现高质量的角色生成与编辑,且在身份一致性、空间控制和指令对齐方面存在不足。 Method: 提出ReMix框架,包含两个核心组件:ReMix模块利用MLLM进行语义特征编辑和指令适配;IP-ControlNet扩展ControlNet,解耦语义与布局线索,并引入ε-等变潜在空间,在共享噪声空间中联合去噪,提升特征对齐。 Result: 实验表明ReMix在个性化生成、图像编辑、风格迁移和多条件合成等多种任务中均表现出色,实现了高身份一致性和姿态可控性。 Conclusion: ReMix是一个有效且高效的统一框架,能够在无需微调的前提下,同时支持角色一致的图像生成与编辑。 Abstract: Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an {\epsilon}-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.[169] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Zhenjie Mao,Yuhuan Yang,Chaofan Ma,Dongsheng Jiang,Jiangchao Yao,Ya Zhang,Yanfeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的框架SaFiRe,用于解决指代表达式分割中的指代歧义问题,特别是在对象干扰和类别隐含的复杂表达式场景下,结合Mamba模型的优势实现了高效多轮细化,并提出了新基准aRefCOCO进行评估。
Details
Motivation: 现有方法主要处理简单表达式,难以应对现实世界中存在指代歧义的复杂语言表达,限制了模型在真实场景中的应用能力。 Method: 提出SaFiRe框架,模拟人类两阶段认知过程:先全局理解,再通过细节检查进行精细化;利用Mamba模型的扫描-更新特性支持分阶段设计,实现线性复杂度下的多轮细化。 Result: 在标准数据集和新提出的aRefCOCO基准上,SaFiRe均优于现有最先进方法,验证了其在处理复杂指代表达上的有效性。 Conclusion: SaFiRe通过模仿人类认知过程并结合Mamba结构,有效提升了对复杂、歧义指代表达的理解与分割性能,推动RIS向更贴近真实应用场景发展。 Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions--short, clear noun phrases like "red car" or "left girl". This simplification often reduces RIS to a key word/concept matching problem, limiting the model's ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process--first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba's scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.[170] SparseUWSeg: Active Sparse Point-Label Augmentation for Underwater Semantic Segmentation
César Borja,Carlos Plou,Rubén Martinez-Cantín,Ana C. Murillo
Main category: cs.CV
TL;DR: 本文提出了一种名为SparseUWSeg的新框架,用于解决水下图像语义分割中标注成本高和稀疏标签传播的挑战。该框架结合主动采样策略和混合标签传播方法,显著提升了分割性能,并提供了一个交互式标注工具,帮助生态研究人员高效生成高质量分割掩码。
Details
Motivation: 水下场景细粒度分析因密集标注成本高昂而受限,稀疏点标注虽易获取但存在选择和传播难题。 Method: 提出SparseUWSeg框架,采用主动采样策略指导标注,并结合SAM2与超像素方法进行稀疏标签传播。 Result: 在两个水下数据集上实验表明,相比现有方法(如D+NN),mIoU最高提升5%。 Conclusion: SparseUWSeg有效解决了水下图像分割中的标注效率与质量平衡问题,所提出的交互式工具为生态研究者提供了实用的自动化分析手段。 Abstract: Semantic segmentation is essential to automate underwater imagery analysis with ecology monitoring purposes. Unfortunately, fine grained underwater scene analysis is still an open problem even for top performing segmentation models. The high cost of obtaining dense, expert-annotated, segmentation labels hinders the supervision of models in this domain. While sparse point-labels are easier to obtain, they introduce challenges regarding which points to annotate and how to propagate the sparse information. We present SparseUWSeg, a novel framework that addresses both issues. SparseUWSeg employs an active sampling strategy to guide annotators, maximizing the value of their point labels. Then, it propagates these sparse labels with a hybrid approach leverages both the best of SAM2 and superpixel-based methods. Experiments on two diverse underwater datasets demonstrate the benefits of SparseUWSeg over state-of-the-art approaches, achieving up to +5\% mIoU over D+NN. Our main contribution is the design and release of a simple but effective interactive annotation tool, integrating our algorithms. It enables ecology researchers to leverage foundation models and computer vision to efficiently generate high-quality segmentation masks to process their data.[171] ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis
Cristiano Patrício,Luís F. Teixeira,João C. Neves
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的视觉概念可解释性框架ViConEx-Med,通过引入多概念可学习令牌联合预测和定位视觉概念,在保持高预测精度的同时生成概念级定位图。
Details
Motivation: 现有概念模型多将概念视为数值属性,缺乏可视化解释来定位预测的概念,限制了其在医疗等高风险场景中的应用。 Method: 提出ViConEx-Med框架,使用多概念可学习令牌,并设计专用注意力层处理视觉和文本概念令牌,实现概念的联合预测与定位。 Result: 在合成和真实医学数据集上的实验表明,ViConEx-Med在概念检测和定位精度上优于先前的概念模型,并与黑盒模型性能相当。 Conclusion: ViConEx-Med为构建基于视觉概念的内在可解释模型提供了有前景的方向。 Abstract: Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.[172] HccePose(BF): Predicting Front \& Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation
Yulin Wang,Mengting Hu,Hongli Li,Chen Luo
Main category: cs.CV
TL;DR: 本文提出了一种新的姿态估计方法,通过预测物体前后表面的3D坐标并密集采样其间点,生成超密集的2D-3D对应关系,结合提出的分层连续坐标编码(HCCE),显著提升了基于PnP算法的姿态估计精度,在七个BOP数据集上优于现有SOTA方法。
Details
Motivation: 现有姿态估计方法主要关注物体前表面3D坐标的精确预测,忽略了后表面和内部结构的潜在价值,限制了2D-3D对应关系的密度和姿态估计精度。 Method: 提出预测物体前后表面的3D坐标,并在两者之间进行密集3D采样,构建更密集的2D-3D对应;同时设计了分层连续坐标编码(HCCE)以更高效准确地表示前后表面坐标。 Result: 在BOP网站的七个经典核心数据集上实验表明,该方法性能优于当前最先进的方法。 Conclusion: 充分利用物体全表面及内部信息可有效提升姿态估计精度,所提出的超密集2D-3D对应与HCCE编码为未来研究提供了新方向。 Abstract: In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.[173] TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
Zixu Zhao,Yang Zhan
Main category: cs.CV
TL;DR: 本文提出了一个用于无人机视频-文本检索的新数据集DVTMD和一种多粒度对齐框架TCMA,显著提升了文本与视频的匹配性能。
Details
Motivation: 现有无人机视频数据集存在标注粗糙、冗余等问题,限制了文本-视频检索的发展,因此需要构建更精细、语义多样化的数据集并设计相应的高效检索方法。 Method: 构建了包含2,864个视频和14,320条细粒度标注的DVTMD数据集;提出TCMA框架,结合全局视频-句子对齐、句子引导的帧聚合和词语引导的图像块对齐,并引入词语与图像块选择模块及文本自适应动态温度机制以优化局部对齐。 Result: 在DVTMD和CapERA数据集上建立了首个完整的无人机文本-视频检索基准,TCMA在文本到视频检索中达到45.5%的R@1,在视频到文本检索中达到42.8%的R@1,性能领先。 Conclusion: 所构建的数据集和提出的TCMA框架有效提升了无人机领域文本-视频检索的精度和语义匹配能力,为后续研究提供了重要基础。 Abstract: Unmanned aerial vehicles (UAVs) have become powerful platforms for real-time, high-resolution data collection, producing massive volumes of aerial videos. Efficient retrieval of relevant content from these videos is crucial for applications in urban management, emergency response, security, and disaster relief. While text-video retrieval has advanced in natural video domains, the UAV domain remains underexplored due to limitations in existing datasets, such as coarse and redundant captions. Thus, in this work, we construct the Drone Video-Text Match Dataset (DVTMD), which contains 2,864 videos and 14,320 fine-grained, semantically diverse captions. The annotations capture multiple complementary aspects, including human actions, objects, background settings, environmental conditions, and visual style, thereby enhancing text-video correspondence and reducing redundancy. Building on this dataset, we propose the Text-Conditioned Multi-granularity Alignment (TCMA) framework, which integrates global video-sentence alignment, sentence-guided frame aggregation, and word-guided patch alignment. To further refine local alignment, we design a Word and Patch Selection module that filters irrelevant content, as well as a Text-Adaptive Dynamic Temperature Mechanism that adapts attention sharpness to text type. Extensive experiments on DVTMD and CapERA establish the first complete benchmark for drone text-video retrieval. Our TCMA achieves state-of-the-art performance, including 45.5% R@1 in text-to-video and 42.8% R@1 in video-to-text retrieval, demonstrating the effectiveness of our dataset and method. The code and dataset will be released.[174] Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification
Haohua Dong,Ana Manzano Rodríguez,Camille Guinaudeau,Shin'ichi Satoh
Main category: cs.CV
TL;DR: 提出伪平衡方法,在半监督学习中通过在伪标签选择过程中强制人口统计学平衡来减轻面部性别分类模型中的偏见,仅使用无标签图像即可显著提升公平性和准确性。
Details
Motivation: 面部性别分类模型常因训练数据中的偏差导致在不同性别和种族子群体间表现不均,需有效缓解此类偏差。 Method: 提出伪平衡策略,在伪标签选择阶段利用来自种族平衡数据集的无标签图像,强制实现人口统计学平衡,无需真实标注。 Result: 在FairFace和AAF数据集上验证,整体准确率提升6.53%至79.81%,性别准确率差距减少44.17%;在东亚子群体中,差距从49%以上缩小至5.01%。 Conclusion: 即使缺乏标签监督,仅使用人口统计学平衡或轻度偏斜的无标签数据集即可有效去偏现有视觉模型。 Abstract: Face gender classification models often reflect and amplify demographic biases present in their training data, leading to uneven performance across gender and racial subgroups. We introduce pseudo-balancing, a simple and effective strategy for mitigating such biases in semi-supervised learning. Our method enforces demographic balance during pseudo-label selection, using only unlabeled images from a race-balanced dataset without requiring access to ground-truth annotations. We evaluate pseudo-balancing under two conditions: (1) fine-tuning a biased gender classifier using unlabeled images from the FairFace dataset, and (2) stress-testing the method with intentionally imbalanced training data to simulate controlled bias scenarios. In both cases, models are evaluated on the All-Age-Faces (AAF) benchmark, which contains a predominantly East Asian population. Our results show that pseudo-balancing consistently improves fairness while preserving or enhancing accuracy. The method achieves 79.81% overall accuracy - a 6.53% improvement over the baseline - and reduces the gender accuracy gap by 44.17%. In the East Asian subgroup, where baseline disparities exceeded 49%, the gap is narrowed to just 5.01%. These findings suggest that even in the absence of label supervision, access to a demographically balanced or moderately skewed unlabeled dataset can serve as a powerful resource for debiasing existing computer vision models.[175] B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding
Feng Xiao,Hongbin Xu,Hai Ci,Wenxiong Kang
Main category: cs.CV
TL;DR: 提出了一种基于n元关系学习的渐进式框架,用于3D对象定位,通过扩展关系建模至n元组合并设计分组监督损失,在ReferIt3D和ScanRefer数据集上超越了现有方法。
Details
Motivation: 现有方法仅对成对对象建模空间关系,忽略了多模态理解中n元关系组合的全局感知重要性,难以准确匹配自然语言描述与3D场景。 Method: 提出渐进式关系学习框架,将关系学习从二元扩展到n元;构建包含n元关系的场景图,并采用具有混合注意力机制的多模态网络进行目标定位;设计分组监督损失以支持无明确标注的n元关系学习。 Result: 在ReferIt3D和ScanRefer基准上的实验表明,该方法优于当前最先进方法,消融研究验证了n元关系感知在3D定位中的有效性。 Conclusion: n元关系建模能更好地捕捉复杂空间描述的语义,提升3D语言对齐性能,所提框架为3D对象定位提供了更全局、更精确的关系理解方式。 Abstract: Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.[176] From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology
Yizhi Wang,Li Chen,Qiang Huang,Tian Guan,Xi Deng,Zhiyuan Shen,Jiawen Li,Xinrui Chen,Bin Hu,Xitong Ling,Taojie Zhu,Zirui Huang,Deshui Yu,Yan Liu,Jiurun Chen,Lianghui Zhu,Qiming He,Yiqing Liu,Diwei Shi,Hanzhong Liu,Junbo Hu,Hongyi Gao,Zhen Song,Xilong Zhao,Chao He,Ming Zhao,Yonghong He
Main category: cs.CV
TL;DR: 提出了一种名为CerS-Path的宫颈亚专科病理诊断系统,通过自监督学习和多模态增强预训练,在八项诊断功能上表现出色,具有高敏感性和良好的泛化能力,适用于宫颈癌筛查。
Details
Motivation: 现有的深度学习模型在准确性和泛化能力上不足,通用基础模型难以捕捉亚专科特定特征,因此需要一种专门针对宫颈病理的诊断系统。 Method: 采用两阶段协同预训练:首先在约1.9亿个组织块上进行自监督学习以构建宫颈特异性特征提取器,然后使用250万对图像-文本数据进行多模态增强,并集成多个下游诊断功能。 Result: 在五个中心的3,173个病例前瞻性测试中,CerS-Path保持了99.38%的筛查敏感性,并展现出优异的泛化能力和临床适用性。 Conclusion: CerS-Path在宫颈病理诊断方面显著优于先前的基础模型,具备广泛应用于亚专科诊断转化和宫颈癌筛查的潜力。 Abstract: Cervical cancer remains a major malignancy, necessitating extensive and complex histopathological assessments and comprehensive support tools. Although deep learning shows promise, these models still lack accuracy and generalizability. General foundation models offer a broader reach but remain limited in capturing subspecialty-specific features and task adaptability. We introduce the Cervical Subspecialty Pathology (CerS-Path) diagnostic system, developed through two synergistic pretraining stages: self-supervised learning on approximately 190 million tissue patches from 140,000 slides to build a cervical-specific feature extractor, and multimodal enhancement with 2.5 million image-text pairs, followed by integration with multiple downstream diagnostic functions. Supporting eight diagnostic functions, including rare cancer classification and multimodal Q&A, CerS-Path surpasses prior foundation models in scope and clinical applicability. Comprehensive evaluations demonstrate a significant advance in cervical pathology, with prospective testing on 3,173 cases across five centers maintaining 99.38% screening sensitivity and excellent generalizability, highlighting its potential for subspecialty diagnostic translation and cervical cancer screening.[177] A Style-Based Metric for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Image Datasets
Dingyi Yao,Xinyao Han,Ruibo Ming,Zhihang Song,Lihui Peng,Jianming Hu,Danya Yao,Yi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种系统性框架和新评估指标SEDD,用于量化自动驾驶中合成数据与真实数据之间的域差距,通过风格嵌入分布差异和风格提取方法,在多个数据集上验证了其有效性,为合成数据的质量控制和改进提供了标准化工具。
Details
Motivation: 由于真实世界测试成本高且不切实际,合成数据成为自动驾驶感知系统测试的重要替代方案,但合成与真实数据间的域差距限制了模型的泛化能力,亟需有效方法来量化这一差距以提升训练效果。 Method: 提出基于Gram矩阵的风格提取方法,结合度量学习优化类内紧凑性和类间分离性,提取风格嵌入,并定义风格嵌入分布差异(SEDD)作为评估指标,构建公开数据集上的基准进行评估。 Result: 在多种数据集和sim-to-real方法上的实验表明,SEDD能有效量化合成到真实的域差距,且与模型性能具有良好的相关性。 Conclusion: 所提出的SEDD指标和框架为合成数据质量评估提供了标准化工具,有助于诊断和增强合成数据,推动数据驱动的自动驾驶系统发展。 Abstract: Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a critical bottleneck for the generalization of AI-based autonomous driving models. Quantifying this synthetic-to-real gap is thus essential for evaluating dataset utility and guiding the design of more effective training pipelines. In this paper, we establish a systematic framework for quantifying the synthetic-to-real gap in autonomous driving systems, and propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized quality control tool that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.[178] Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images
Chuangchuang Tan,Xiang Ming,Jinglu Wang,Renshuai Tao,Bin Li,Yunchao Wei,Yao Zhao,Yan Lu
Main category: cs.CV
TL;DR: 本文提出了AnomReason,一个用于检测和推理AI生成图像中语义异常的大规模基准数据集,并通过多智能体流水线AnomAgent生成结构化标注,显著提升了语义异常检测性能,支持可解释的深度伪造检测与图像生成器的语义合理性评估。
Details
Motivation: AI生成内容(AIGC)虽在视觉上日益逼真,但常存在语义异常(如物理规律违反、常识错误),影响其可信度。因此,亟需对AIGC图像的语义合理性进行系统建模与检测。 Method: 提出语义异常检测与推理任务,构建包含四元组(名称、现象、推理、严重程度)标注的大规模基准AnomReason;设计模块化的多智能体流水线AnomAgent,结合轻量级人工验证生成高质量结构化数据;引入语义匹配指标SemAP和SemF1评估模型性能。 Result: 基于AnomReason微调的模型在SemAP和SemF1指标上显著优于强基线模型;AnomAgent在构建过程中处理了约41.7亿GPT-4o token,验证了方法的可扩展性;在可解释深度伪造检测和图像生成器语义合理性评估中展现实际应用价值。 Conclusion: AnomReason和AnomAgent为衡量与提升AI生成图像的语义可信度提供了基础框架,推动语义真实性评估与可解释AIGC取证研究的发展。 Abstract: The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.[179] MRI Brain Tumor Detection with Computer Vision
Jack Krolik,Jake Lynn,John Henry Rudden,Dmytro Vremenko
Main category: cs.CV
TL;DR: 该研究探讨了深度学习技术在MRI扫描中自动检测和分割脑肿瘤的应用,使用了多种模型如逻辑回归、CNN、ResNet进行分类,并采用U-Net和EfficientDet提升肿瘤定位与识别。结果表明,该方法在诊断准确性和效率方面表现良好,展现了深度学习在医学影像中的潜力。
Details
Motivation: 提高脑肿瘤诊断的准确性与效率,利用深度学习辅助医学影像分析,改善临床治疗效果。 Method: 采用了逻辑回归、卷积神经网络(CNN)、残差网络(ResNet)进行肿瘤分类,使用U-Net进行语义分割,EfficientDet进行目标检测。 Result: 所提出的方法在脑肿瘤的检测与分割任务中表现出较高的准确性和效率,显著优于传统方法。 Conclusion: 深度学习技术在脑肿瘤的自动检测与分割中具有巨大潜力,能够有效提升医学影像分析的性能,有助于临床决策。 Abstract: This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U-Net for semantic segmentation and EfficientDet for anchor-based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.[180] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Yuxiang Lai,Jike Zhong,Ming Li,Yuheng Li,Xiaofeng Yang
Main category: cs.CV
TL;DR: 该研究探索了大型视觉模型(LVM)在未经医学数据训练的情况下,通过零样本方式执行多种医学影像任务(如器官分割、去噪、超分辨率和运动预测)的能力,结果显示其在多项任务中表现优异,尤其在放疗运动预测中展现出解剖一致性和时间连贯性,超越专用基线方法。
Details
Motivation: 受大规模生成模型在跨领域零样本泛化能力的启发,研究者希望验证自回归视频建模是否可直接应用于医学影像任务,即使模型未接触过医学数据。 Method: 在零样本设置下,评估一个大型视觉模型(LVM)在四种典型医学任务中的表现:器官分割、去噪、超分辨率和4D CT序列中的3D相位运动预测。 Result: LVM在未进行领域微调的情况下,在CT图像中准确分割解剖结构,并在去噪和超分辨率任务中表现良好;在122名患者的4D CT数据(共1820多个3D体积)上,运动预测结果具有解剖合理性和时间连续性,空间精度超越基于DVF和生成式基线方法,达到最先进水平。 Conclusion: 研究表明,通用视频模型具备零样本医学视频建模潜力,可作为统一的医学基础模型,为未来无需专门训练即可应用于临床影像分析的通用模型奠定基础。 Abstract: Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.[181] Opacity-Gradient Driven Density Control for Compact and Efficient Few-Shot 3D Gaussian Splatting
Abdelrhman Elrawy,Emad A. Mohammed
Main category: cs.CV
TL;DR: 本文提出了一种改进3D高斯点阵(3DGS)在少样本场景下重建效率的框架,通过使用不透明度梯度作为渲染误差的轻量级代理来替代传统的位置梯度启发式方法,并结合保守的剪枝策略和深度相关损失,显著提升了模型紧凑性,同时在质量与效率的权衡上达到新的前沿水平。
Details
Motivation: 3DGS在少样本情况下容易过拟合且重建结果臃肿,现有方法虽提升质量但增加大量图元数量,缺乏效率优化。 Method: 提出一种新的致密化触发机制,使用不透明度梯度代替位置梯度,并配合更保守的剪枝调度和深度相关损失进行几何引导。 Result: 在LLFF三视图数据集上比FSGS减少40%图元数量(32k vs 57k),在Mip-NeRF 360数据集上减少约70%,显著提升模型紧凑性。 Conclusion: 该方法在少量输入视角下实现了更高效、更紧凑的3D重建,在质量与效率的帕累托前沿上优于现有方法。 Abstract: 3D Gaussian Splatting (3DGS) struggles in few-shot scenarios, where its standard adaptive density control (ADC) can lead to overfitting and bloated reconstructions. While state-of-the-art methods like FSGS improve quality, they often do so by significantly increasing the primitive count. This paper presents a framework that revises the core 3DGS optimization to prioritize efficiency. We replace the standard positional gradient heuristic with a novel densification trigger that uses the opacity gradient as a lightweight proxy for rendering error. We find this aggressive densification is only effective when paired with a more conservative pruning schedule, which prevents destructive optimization cycles. Combined with a standard depth-correlation loss for geometric guidance, our framework demonstrates a fundamental improvement in efficiency. On the 3-view LLFF dataset, our model is over 40% more compact (32k vs. 57k primitives) than FSGS, and on the Mip-NeRF 360 dataset, it achieves a reduction of approximately 70%. This dramatic gain in compactness is achieved with a modest trade-off in reconstruction metrics, establishing a new state-of-the-art on the quality-vs-efficiency Pareto frontier for few-shot view synthesis.[182] VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework
Donglin Huang,Yongyuan Li,Tianhang Liu,Junming Huang,Xiaoda Yang,Chi Wang,Weiwei Xu
Main category: cs.CV
TL;DR: 本文提出VividAnimator,一种端到端的音频与稀疏手部姿态驱动半身人体动画生成框架,通过预训练手部清晰码本、双流音频感知模块和姿态校准技巧,显著提升手部细节、动作自然性和身份一致性。
Details
Motivation: 现有音频驱动动画方法在头部运动和手部清晰度方面存在不足,主要由于音频与头部运动关联弱以及手部结构复杂。 Method: 提出三个创新:1)预训练手部清晰码本(HCC)以保留手部纹理细节;2)设计双流音频感知模块(DSAA)分别建模口型同步与头部动态并实现交互;3)引入姿态校准技巧(PCT)优化姿态条件对齐。 Result: 实验表明VividAnimator在手部细节、动作真实感和身份一致性方面优于现有方法,定量与定性评估均达到SOTA水平。 Conclusion: VividAnimator有效解决了音频驱动动画中的手部模糊与头部僵硬问题,生成高质量、自然的半身人物动画。 Abstract: Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.[183] Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking
Markus Käppeler,Özgün Çiçek,Daniele Cattaneo,Claudius Gläser,Yakov Miron,Abhinav Valada
Main category: cs.CV
TL;DR: 本文提出了一种名为DualViewDistill的双视角检测与跟踪框架,结合透视图(PV)和鸟瞰图(BEV)特征,并利用DINOv2基础模型提取的语义与几何特征增强BEV表示,通过可变形聚合提升3D目标检测与跟踪性能,在nuScenes和Argoverse 2数据集上达到最先进水平。
Details
Motivation: 现有方法通常仅使用透视图或鸟瞰图特征,难以同时利用精细物体细节和结构化空间信息,限制了自动驾驶感知性能。 Method: 提出DualViewDistill框架,融合PV和BEV特征;引入基于基础模型引导的BEV地图,通过新颖的蒸馏过程将DINOv2特征融入BEV表示,并结合PV特征进行可变形聚合以增强3D检测与跟踪。 Result: 在nuScenes和Argoverse 2基准上取得领先性能,验证了所提方法在3D目标检测与跟踪中的有效性。 Conclusion: DualViewDistill通过融合PV与BEV双视角特征及基础模型提供的丰富语义与几何信息,显著提升了自动驾驶场景下的感知能力,展示了基础模型生成BEV地图的潜力。 Abstract: Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .[184] SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation
Sayan Mandal,Divyadarshini Karthikeyan,Manas Paldhe
Main category: cs.CV
TL;DR: SAM2LoRA是一种针对眼底图像分割的参数高效微调方法,通过在SAM2的图像编码器和掩码解码器中引入低秩适配器,仅需不到5%的可训练参数即实现最先进的性能。
Details
Motivation: SAM2虽具有强大的多尺度特征解码能力,但在微调方面存在挑战,尤其在跨数据集的眼底图像分割任务中需要高效的适应策略。 Method: 提出SAM2LoRA,在SAM2的图像编码器和掩码解码器中集成低秩适配器,并采用结合BCE、SoftDice和FocalTversky的复合损失函数进行网络优化。 Result: 在11个眼底图像分割数据集上验证,SAM2LoRA在血管和视盘分割任务中分别达到最高0.86和0.93的Dice分数,以及0.98和0.99的AUC值,显著降低训练开销的同时实现SOTA性能。 Conclusion: SAM2LoRA以极低的参数量实现了对SAM2的有效微调,适用于资源受限场景下的跨数据集眼底图像分割任务。 Abstract: We propose SAM2LoRA, a parameter-efficient fine-tuning strategy that adapts the Segment Anything Model 2 (SAM2) for fundus image segmentation. SAM2 employs a masked autoencoder-pretrained Hierarchical Vision Transformer for multi-scale feature decoding, enabling rapid inference in low-resource settings; however, fine-tuning remains challenging. To address this, SAM2LoRA integrates a low-rank adapter into both the image encoder and mask decoder, requiring fewer than 5\% of the original trainable parameters. Our analysis indicates that for cross-dataset fundus segmentation tasks, a composite loss function combining segmentationBCE, SoftDice, and FocalTversky losses is essential for optimal network tuning. Evaluated on 11 challenging fundus segmentation datasets, SAM2LoRA demonstrates high performance in both blood vessel and optic disc segmentation under cross-dataset training conditions. It achieves Dice scores of up to 0.86 and 0.93 for blood vessel and optic disc segmentation, respectively, and AUC values of up to 0.98 and 0.99, achieving state-of-the-art performance while substantially reducing training overhead.[185] From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries
Joy Hsu,Emily Jin,Jiajun Wu,Niloy J. Mitra
Main category: cs.CV
TL;DR: 提出FactoredScenes框架,通过分解场景为房间程序和物体姿态的分层表示,生成逼真的3D场景。
Details
Motivation: 现实场景数据有限,生成具有多样物体姿态的真实感场景具有挑战性。 Method: 采用分层表示方法,结合学习到的布局模式库和大语言模型生成高层程序,并训练程序条件下的层级物体姿态预测模型。 Result: 能够生成难以与真实ScanNet场景区分的逼真3D房间。 Conclusion: FactoredScenes有效合成真实感强、多样性高的3D室内场景。 Abstract: Real-world scenes, such as those in ScanNet, are difficult to capture, with highly limited data available. Generating realistic scenes with varied object poses remains an open and challenging task. In this work, we propose FactoredScenes, a framework that synthesizes realistic 3D scenes by leveraging the underlying structure of rooms while learning the variation of object poses from lived-in scenes. We introduce a factored representation that decomposes scenes into hierarchically organized concepts of room programs and object poses. To encode structure, FactoredScenes learns a library of functions capturing reusable layout patterns from which scenes are drawn, then uses large language models to generate high-level programs, regularized by the learned library. To represent scene variations, FactoredScenes learns a program-conditioned model to hierarchically predict object poses, and retrieves and places 3D objects in a scene. We show that FactoredScenes generates realistic, real-world rooms that are difficult to distinguish from real ScanNet scenes.[186] Ordinal Scale Traffic Congestion Classification with Multi-Modal Vision-Language and Motion Analysis
Yu-Hsuan Lin
Main category: cs.CV
TL;DR: 本文提出了一种结合视觉-语言推理、目标检测和运动分析的多模态交通拥堵分类框架,能够实现语义对齐且时间一致的拥堵等级预测(1到5级),并在准确率、F1分数和加权Kappa等指标上显著优于单模态基线方法。
Details
Motivation: 准确的交通拥堵分类对于智能交通系统和实时城市交通管理至关重要,现有方法在语义对齐、时序一致性和解释性方面存在不足。 Method: 结合开放词汇的视觉-语言模型CLIP、YOLO-World目标检测和MOG2背景减除进行运动分析,引入基于运动的置信度加权,并生成带注释的可视化输出以提升可解释性。 Result: 模型达到76.7%的准确率、0.752的F1分数和0.684的二次加权Kappa,显著优于单模态基线,验证了多模态融合在保持拥堵等级序数结构上的有效性。 Conclusion: 所提出的多模态框架在交通拥堵分类任务中表现优异,有效融合了视觉-语言语义与运动信息,具备良好的可解释性和应用潜力,未来将引入车辆尺寸和密度细化指标以进一步提升性能。 Abstract: Accurate traffic congestion classification is essential for intelligent transportation systems and real-time urban traffic management. This paper presents a multimodal framework combining open-vocabulary visual-language reasoning (CLIP), object detection (YOLO-World), and motion analysis via MOG2-based background subtraction. The system predicts congestion levels on an ordinal scale from 1 (free flow) to 5 (severe congestion), enabling semantically aligned and temporally consistent classification. To enhance interpretability, we incorporate motion-based confidence weighting and generate annotated visual outputs. Experimental results show the model achieves 76.7 percent accuracy, an F1 score of 0.752, and a Quadratic Weighted Kappa (QWK) of 0.684, significantly outperforming unimodal baselines. These results demonstrate the framework's effectiveness in preserving ordinal structure and leveraging visual-language and motion modalities. Future enhancements include incorporating vehicle sizing and refined density metrics.[187] Ortho-Fuse: Orthomosaic Generation for Sparse High-Resolution Crop Health Maps Through Intermediate Optical Flow Estimation
Rugved Katole,Christopher Stewart
Main category: cs.CV
TL;DR: 本文提出了一种基于光流的框架Ortho-Fuse,可在低重叠航拍图像下生成可靠的正射影像,降低传统摄影测量对图像重叠率(70-80%)的依赖,实验显示最低重叠需求减少20%,有助于提升AI驱动作物健康监测系统的农民采纳率。
Details
Motivation: 传统摄影测量需要高图像重叠率(70-80%)以保证几何重建精度,但在资源受限的AI农业监测系统中难以满足,导致重建质量下降,阻碍农民采纳。 Method: 提出Ortho-Fuse框架,利用光流估计在连续航拍帧之间合成过渡图像,通过增加特征对应关系来增强几何重建,从而降低对图像重叠率的要求。 Result: 实验验证该方法可将最低图像重叠要求降低20%,显著提升稀疏图像集下的正射影像生成质量。 Conclusion: Ortho-Fuse有效缓解了低重叠条件下航拍图像重建难题,提升了AI农业监测系统的实用性与可靠性,为推动精准农业中AI技术的普及提供了可行路径。 Abstract: AI-driven crop health mapping systems offer substantial advantages over conventional monitoring approaches through accelerated data acquisition and cost reduction. However, widespread farmer adoption remains constrained by technical limitations in orthomosaic generation from sparse aerial imagery datasets. Traditional photogrammetric reconstruction requires 70-80\% inter-image overlap to establish sufficient feature correspondences for accurate geometric registration. AI-driven systems operating under resource-constrained conditions cannot consistently achieve these overlap thresholds, resulting in degraded reconstruction quality that undermines user confidence in autonomous monitoring technologies. In this paper, we present Ortho-Fuse, an optical flow-based framework that enables the generation of a reliable orthomosaic with reduced overlap requirements. Our approach employs intermediate flow estimation to synthesize transitional imagery between consecutive aerial frames, artificially augmenting feature correspondences for improved geometric reconstruction. Experimental validation demonstrates a 20\% reduction in minimum overlap requirements. We further analyze adoption barriers in precision agriculture to identify pathways for enhanced integration of AI-driven monitoring systems.[188] PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion
Linlian Jiang,Rui Ma,Li Gu,Ziqiang Wang,Xinxin Zuo,Yang Wang
Main category: cs.CV
TL;DR: 本文提出PointMAC,一种用于点云补全的元学习框架,通过自监督辅助目标和元辅助学习策略实现测试时自适应优化,显著提升在新结构模式和传感器失真下的鲁棒性,是首个将元辅助测试时适应应用于点云补全的工作。
Details
Motivation: 现有点云补全模型依赖训练时学习的归纳偏置,缺乏对测试时新结构模式和传感器畸变的适应能力,限制了其在安全关键应用中的鲁棒性。 Method: 提出PointMAC框架,利用两个模拟结构与传感器级缺失的自监督辅助目标,在测试时对共享编码器进行在线优化;采用基于MAML的元辅助学习策略,并引入自适应λ校准机制平衡主任务与辅助任务的梯度。 Result: 在合成、仿真和真实世界数据集上均取得最先进性能,能够逐样本优化生成高质量补全结果,有效提升模型鲁棒性和适应性。 Conclusion: PointMAC首次实现了点云补全中的元辅助测试时自适应,验证了其在无需额外监督的情况下提升补全质量与鲁棒性的有效性,为3D感知系统的实际部署提供了新思路。 Abstract: Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness. A meta-auxiliary learning strategy based on Model-Agnostic Meta-Learning (MAML) ensures that adaptation driven by auxiliary objectives is consistently aligned with the primary completion task. During inference, we adapt the shared encoder on-the-fly by optimizing auxiliary losses, with the decoder kept fixed. To further stabilize adaptation, we introduce Adaptive $\lambda$-Calibration, a meta-learned mechanism for balancing gradients between primary and auxiliary objectives. Extensive experiments on synthetic, simulated, and real-world datasets demonstrate that PointMAC achieves state-of-the-art results by refining each sample individually to produce high-quality completions. To the best of our knowledge, this is the first work to apply meta-auxiliary test-time adaptation to point cloud completion.[189] Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure
Saurabh Kataria,Ayca Ermis,Lovely Yeswanth Panchumarthi,Minxiao Wang,Xiao Hu
Main category: cs.CV
TL;DR: 提出Vision4PPG,将PPG信号转换为二维图像表示,利用视觉基础模型(如DINOv3、SIGLIP-2)在生理信号分析中实现SOTA性能,尤其在血压估计等任务上表现突出。
Details
Motivation: 探索非传统视觉模型在生理信号处理中的潜力,弥补现有时间序列模型在PPG任务上的局限性,并提升模型泛化能力与计算效率。 Method: 将一维PPG信号转换为二维表示(如STFT、相位图、递归图),并采用最新的视觉基础模型(VFM)结合参数高效微调(PEFT)技术进行训练和评估。 Result: 在多个生理任务(包括血压估计、生命体征和血液指标预测)上达到SOTA性能,验证了方法在六项额外任务上的通用性和鲁棒性,且具备良好计算效率。 Conclusion: Vision4PPG成功拓展了视觉基础模型在PPG信号处理中的应用,为临床研究提供了高效、通用的新工具。 Abstract: Photoplethysmography (PPG) sensor in wearable and clinical devices provides valuable physiological insights in a non-invasive and real-time fashion. Specialized Foundation Models (FM) or repurposed time-series FMs are used to benchmark physiological tasks. Our experiments with fine-tuning FMs reveal that Vision FM (VFM) can also be utilized for this purpose and, in fact, surprisingly leads to state-of-the-art (SOTA) performance on many tasks, notably blood pressure estimation. We leverage VFMs by simply transforming one-dimensional PPG signals into image-like two-dimensional representations, such as the Short-Time Fourier transform (STFT). Using the latest VFMs, such as DINOv3 and SIGLIP-2, we achieve promising performance on other vital signs and blood lab measurement tasks as well. Our proposal, Vision4PPG, unlocks a new class of FMs to achieve SOTA performance with notable generalization to other 2D input representations, including STFT phase and recurrence plots. Our work improves upon prior investigations of vision models for PPG by conducting a comprehensive study, comparing them to state-of-the-art time-series FMs, and demonstrating the general PPG processing ability by reporting results on six additional tasks. Thus, we provide clinician-scientists with a new set of powerful tools that is also computationally efficient, thanks to Parameter-Efficient Fine-Tuning (PEFT) techniques.[190] Self-Supervised Multi-Scale Transformer with Attention-Guided Fusion for Efficient Crack Detection
Blessing Agyei Kyem,Joshua Kofi Asamoah,Eugene Denteh,Andrews Danyo,Armstrong Aboah
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的全自监督框架Crack-Segmenter,用于路面裂缝检测,在十个多达公共数据集上表现优于13种最先进的监督方法。
Details
Motivation: 传统的路面裂缝检测依赖于昂贵且耗时的像素级标注,限制了大规模基础设施监测的可扩展性,因此需要一种无需手动标注即可实现高效像素级裂缝分割的方法。 Method: 提出了一个名为Crack-Segmenter的全自监督框架,包含三个模块:用于多尺度特征提取的尺度自适应嵌入器(SAE)、保持线性裂缝连续性的方向注意力Transformer(DAT),以及用于自适应特征融合的注意力引导融合(AGF)模块。 Result: 在十个公开数据集上的实验表明,Crack-Segmenter在mIoU、Dice分数、XOR和Hausdorff距离等主要指标上 consistently 超过13种先进的监督方法。 Conclusion: 无需标注的裂缝检测不仅是可行的,而且性能更优,为交通部门和基础设施管理者提供了可扩展且成本效益高的监测方案,推动了自监督学习在路面裂缝检测中的应用。 Abstract: Pavement crack detection has long depended on costly and time-intensive pixel-level annotations, which limit its scalability for large-scale infrastructure monitoring. To overcome this barrier, this paper examines the feasibility of achieving effective pixel-level crack segmentation entirely without manual annotations. Building on this objective, a fully self-supervised framework, Crack-Segmenter, is developed, integrating three complementary modules: the Scale-Adaptive Embedder (SAE) for robust multi-scale feature extraction, the Directional Attention Transformer (DAT) for maintaining linear crack continuity, and the Attention-Guided Fusion (AGF) module for adaptive feature integration. Through evaluations on ten public datasets, Crack-Segmenter consistently outperforms 13 state-of-the-art supervised methods across all major metrics, including mean Intersection over Union (mIoU), Dice score, XOR, and Hausdorff Distance (HD). These findings demonstrate that annotation-free crack detection is not only feasible but also superior, enabling transportation agencies and infrastructure managers to conduct scalable and cost-effective monitoring. This work advances self-supervised learning and motivates pavement cracks detection research.[191] Identifying bias in CNN image classification using image scrambling and transforms
Sai Teja Erukude
Main category: cs.CV
TL;DR: 本文探讨了卷积神经网络(CNN)在图像分类中可能存在的隐藏偏差问题,提出了两种无需依赖空白背景即可识别数据集偏差的方法:图像分块打乱和多种图像变换(如傅里叶、小波和中值滤波)组合,实验表明这些方法能有效区分上下文信息与背景噪声。
Details
Motivation: 由于CNN以数据驱动方式自动提取特征,常被视为“黑箱”,其决策可能受到难以察觉的背景信息偏差影响,因此需要方法来揭示和识别这些隐藏偏差。 Method: 提出两种方法:一是将图像分割为不同大小的非重叠块并随机打乱;二是应用傅里叶变换、小波变换和中值滤波等图像变换技术及其组合,以恢复并分析CNN分类时利用的背景噪声信息。 Result: 在六个自然、合成和混合数据集上的实验表明,所提方法能有效识别CNN是否依赖背景噪声进行分类,并可区分上下文信息与无关背景噪声,即使在无空白背景的情况下也能检测到偏差。 Conclusion: 通过图像分块和变换方法,可以在缺乏空白背景的情况下有效揭示CNN中的数据集偏差,提升模型的可解释性与可靠性。 Abstract: CNNs are now prevalent as the primary choice for most machine vision problems due to their superior rate of classification and the availability of user-friendly libraries. These networks effortlessly identify and select features in a non-intuitive data-driven manner, making it difficult to determine which features were most influential. That leads to a ``black box", where users cannot know how the image data are analyzed but rely on empirical results. Therefore the decision-making process can be biased by background information that is difficult to detect. Here we discuss examples of such hidden biases and propose techniques for identifying them, methods to distinguish between contextual information and background noise, and explore whether CNNs learn from irrelevant features. One effective approach to identify dataset bias is to classify blank background parts of the images. However, in some situations a blank background in the images is not available, making it more difficult to separate the foreground information from the blank background. Such parts of the image can also be considered contextual learning, not necessarily bias. To overcome this, we propose two approaches that were tested on six different datasets, including natural, synthetic, and hybrid datasets. The first method involves dividing images into smaller, non-overlapping tiles of various sizes, which are then shuffled randomly, making classification more challenging. The second method involves the application of several image transforms, including Fourier, Wavelet transforms, and Median filter, and their combinations. These transforms help recover background noise information used by CNN to classify images. Results indicate that this method can effectively distinguish between contextual information and background noise, and alert on the presence of background noise even without the need to use background information.[192] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Xinlong Chen,Yue Ding,Weihong Lin,Jingyun Hua,Linli Yao,Yang Shi,Bozhou Li,Yuanxing Zhang,Qiang Liu,Pengfei Wan,Liang Wang,Tieniu Tan
Main category: cs.CV
TL;DR: 本文提出了AVoCaDO,一种基于音频-视觉时序协同的视频描述生成模型,通过两阶段后训练策略(SFT和GRPO)在多个基准上显著优于现有开源模型。
Details
Motivation: 为了提升音频-视觉视频描述中事件的时序对齐与语义丰富性,增强模型对多模态信息的理解与生成能力。 Method: 提出两阶段后训练流程:首先在10.7万高质量时序对齐数据上进行监督微调(AVoCaDO SFT),然后使用定制奖励函数进行组相对策略优化(AVoCaDO GRPO),以提升时序连贯性和对话准确性。 Result: AVoCaDO在四个音频-视觉描述基准上显著超越现有开源模型,在VDC和DREAM-1K视觉-only任务上也表现出竞争力。 Conclusion: AVoCaDO通过精细化的多模态时序建模和优化策略,有效提升了音频-视觉视频描述的质量与鲁棒性。 Abstract: Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.[193] Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes
Zhao-Yang Wang,Jieneng Chen,Jiang Liu,Yuxiang Guo,Rama Chellappa
Main category: cs.CV
TL;DR: 提出Mesh-Gait,一种从2D轮廓图直接重建3D热图作为中间表示的端到端多模态步态识别框架,在保持计算效率的同时实现最先进的识别精度。
Details
Motivation: 现有2D步态识别方法在视角变化、遮挡和噪声下性能受限,而多模态3D方法虽更鲁棒但计算开销大,难以用于实时场景。 Method: Mesh-Gait通过从2D轮廓图重建3D热图作为中间表示,融合2D轮廓特征与3D几何信息;采用监督学习逐步优化3D关节、虚拟标记点和网格的重建损失,实现精确的空间对齐。 Result: 在多个实验中,Mesh-Gait在保持较低计算成本的同时,显著优于现有方法,达到最先进水平的识别准确率。 Conclusion: Mesh-Gait有效平衡了3D几何建模的准确性与计算效率,为实时多模态步态识别提供了一种可行且高性能的解决方案。 Abstract: Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.[194] Guided Image Feature Matching using Feature Spatial Order
Chin-Hung Teng,Ben-Jian Dong
Main category: cs.CV
TL;DR: 本文提出了一种结合特征空间顺序和极线几何的渐进式特征匹配方法,通过建立特征空间顺序模型来预测后续匹配点的空间范围,从而过滤无效匹配,提高匹配效率和准确性。
Details
Motivation: 传统图像特征匹配在处理大量特征时效率较低,且易受图像旋转影响,因此需要一种更高效、鲁棒的匹配方法。 Method: 将特征空间顺序概念引入渐进式匹配框架,利用初始匹配结果构建空间顺序模型,并结合极线几何进行约束;同时提出基于基础矩阵的图像对齐方法以消除旋转影响。 Result: 实验表明,该方法在标准数据集、模拟图像和真实图像上均显著提升了匹配效率和准确率。 Conclusion: 所提出的方法能有效提升特征匹配的效率与精度,尤其适用于高特征密度和存在旋转变化的场景。 Abstract: Image feature matching plays a vital role in many computer vision tasks. Although many image feature detection and matching techniques have been proposed over the past few decades, it is still time-consuming to match feature points in two images, especially for images with a large number of detected features. Feature spatial order can estimate the probability that a pair of features is correct. Since it is a completely independent concept from epipolar geometry, it can be used to complement epipolar geometry in guiding feature match in a target region so as to improve matching efficiency. In this paper, we integrate the concept of feature spatial order into a progressive matching framework. We use some of the initially matched features to build a computational model of feature spatial order and employs it to calculates the possible spatial range of subsequent feature matches, thus filtering out unnecessary feature matches. We also integrate it with epipolar geometry to further improve matching efficiency and accuracy. Since the spatial order of feature points is affected by image rotation, we propose a suitable image alignment method from the fundamental matrix of epipolar geometry to remove the effect of image rotation. To verify the feasibility of the proposed method, we conduct a series of experiments, including a standard benchmark dataset, self-generated simulated images, and real images. The results demonstrate that our proposed method is significantly more efficient and has more accurate feature matching than the traditional method.[195] Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis
Zhao-Yang Wang,Zhimin Shao,Jieneng Chen,Rama Chellappa
Main category: cs.CV
TL;DR: 本文提出了一种结合2D轮廓和3D SMPL特征的多模态、多任务步态识别框架,通过统一的Transformer融合特征,在大规模BRIAR数据集上实现了优于现有方法的识别性能,并能准确估计年龄、BMI和性别等人属性。
Details
Motivation: 单一模态(如2D或3D)难以全面捕捉人类行走的几何与动态复杂性,且现有方法在远距离、低分辨率等复杂环境下性能受限。 Method: 提出一个多模态多任务框架,融合2D时间轮廓和3D SMPL特征,采用统一Transformer进行特征融合,并联合执行步态识别与人体属性(年龄、BMI、性别)估计。 Result: 在BRIAR数据集上实验表明,该方法在步态识别和人体属性估计方面均优于现有最先进方法,尤其在长达1公里距离和高达50度俯仰角的挑战性条件下表现优异。 Conclusion: 多模态与多任务学习能有效提升复杂场景下的步态识别与人体理解能力,具有实际应用前景。 Abstract: Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.[196] Towards Cybersickness Severity Classification from VR Gameplay Videos Using Transfer Learning and Temporal Modeling
Jyotirmay Nag Setu,Kevin Desai,John Quarles
Main category: cs.CV
TL;DR: 本研究利用InceptionV3和LSTM模型,通过VR游戏视频提取视觉特征并预测晕动症严重程度,准确率达68.4%,优于现有纯视频模型,为提升VR用户体验提供了新方法。
Details
Motivation: 现有研究对基于视频特征的晕动症预测关注不足,且VR中晕动症问题限制了其广泛应用。 Method: 采用迁移学习,使用在ImageNet上预训练的InceptionV3模型从VR gameplay视频中提取高层视觉特征,并输入LSTM网络以捕捉时间动态,实现对晕动症严重程度的时间序列预测。 Result: 模型在分类晕动症严重程度上达到68.4%的准确率,性能优于仅基于视频数据的现有模型。 Conclusion: 该方法有效利用视频数据的时间特性,为VR开发者提供了一种实用工具来评估和缓解晕动症,同时为未来基于视频的时间建模研究奠定了基础。 Abstract: With the rapid advancement of virtual reality (VR) technology, its adoption across domains such as healthcare, education, and entertainment has grown significantly. However, the persistent issue of cybersickness, marked by symptoms resembling motion sickness, continues to hinder widespread acceptance of VR. While recent research has explored multimodal deep learning approaches leveraging data from integrated VR sensors like eye and head tracking, there remains limited investigation into the use of video-based features for predicting cybersickness. In this study, we address this gap by utilizing transfer learning to extract high-level visual features from VR gameplay videos using the InceptionV3 model pretrained on the ImageNet dataset. These features are then passed to a Long Short-Term Memory (LSTM) network to capture the temporal dynamics of the VR experience and predict cybersickness severity over time. Our approach effectively leverages the time-series nature of video data, achieving a 68.4% classification accuracy for cybersickness severity. This surpasses the performance of existing models trained solely on video data, providing a practical tool for VR developers to evaluate and mitigate cybersickness in virtual environments. Furthermore, this work lays the foundation for future research on video-based temporal modeling for enhancing user comfort in VR applications.[197] Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs
Suyang Xi,Chenxi Yang,Hong Ding,Yiqing Ni,Catherine C. Liu,Yunhao Liu,Chengqi Zhang
Main category: cs.CV
TL;DR: 提出HuLiRAG框架,通过“what-where-reweight”级联机制提升多模态大模型在细粒度视觉问答中的 grounding 准确性和事实一致性,减少幻觉。
Details
Motivation: 现有MLLM在细粒度视觉问答中易产生关于物体身份、位置和关系的幻觉,且RAG方法缺乏局部细节和类人推理能力。 Method: 采用开放词汇检测(what)、SAM生成掩码进行空间定位(where),并通过局部与全局对齐权衡进行重加权(reweight);结合掩码引导微调将空间证据显式融入生成过程。 Result: 实验表明该方法显著提升了grounding保真度和事实一致性,有效减少幻觉。 Conclusion: HuLiRAG通过类人的三阶段级联推理机制,推动多模态问答向更可靠的方向发展。 Abstract: Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.[198] MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation
Kangjian Zhu,Haobo Jiang,Yigong Zhang,Jianjun Qian,Jian Yang,Jin Xie
Main category: cs.CV
TL;DR: 提出MonoSE(3)-Diffusion,一种基于条件去噪扩散的单目机器人位姿估计框架,通过引入可见性约束和时间步感知反向过程,在两个基准上显著优于现有方法。
Details
Motivation: 现有方法使用固定尺度扰动进行位姿估计,缺乏多样性且易导致训练样本超出视野范围,限制了模型泛化能力。 Method: 将位姿估计建模为条件扩散过程,包含可见性约束的扩散过程用于生成多样化的训练姿态,以及时间步感知的反向过程用于逐步精细化姿态估计。 Result: 在DREAM和RoboKeyGen两个基准上取得提升,尤其在最具挑战性的数据集上AUC达到66.75,比当前最优方法提高32.3%。 Conclusion: 所提出的扩散框架通过可控的噪声生成和渐进式优化策略,有效提升了单目无标记机器人位姿估计的鲁棒性和准确性。 Abstract: We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.[199] On the Problem of Consistent Anomalies in Zero-Shot Industrial Anomaly Detection
Tai Le-Gia,Ahn Jaehyun
Main category: cs.CV
TL;DR: 本文提出了一种名为CoDeGraph的新方法,用于零样本图像异常分类与分割,通过构建图像级图并利用社区检测过滤一致性的异常,在多个基准上显著优于现有方法。
Details
Motivation: 现有的基于表示的方法在处理工业图像中重复出现的一致性异常时表现不佳,导致异常分类和分割性能下降。 Method: 提出CoDeGraph算法,利用‘邻居耗尽’现象识别一致性异常:正常图像块的相似性逐渐上升,而异常图像块在有限匹配后出现突增;通过构建图像级图并应用社区检测来过滤这些异常,并结合极值理论提供理论支持。 Result: 在MVTec AD数据集上,使用ViT-L-14-336骨干网络实现了98.3%的AUROC、66.8%的F1分数(+4.2%)和68.1%的AP(+5.4%);使用DINOv2骨干进一步提升至69.1% F1(+6.5%)和71.9% AP(+9.2%)。 Conclusion: CoDeGraph能有效识别和过滤一致性异常,显著提升零样本异常分类与分割性能,并在不同骨干网络下表现出良好的鲁棒性。 Abstract: Zero-shot image anomaly classification (AC) and segmentation (AS) are vital for industrial quality control, detecting defects without prior training data. Existing representation-based methods compare patch features with nearest neighbors in unlabeled test images but struggle with consistent anomalies -- similar defects recurring across multiple images -- resulting in poor AC/AS performance. We introduce Consistent-Anomaly Detection Graph (CoDeGraph), a novel algorithm that identifies and filters consistent anomalies from similarity computations. Our key insight is that normal patches in industrial images show stable, gradually increasing similarity to other test images, while consistent-anomaly patches exhibit abrupt similarity spikes after exhausting a limited set of similar matches, a phenomenon we term ``neighbor-burnout.'' CoDeGraph constructs an image-level graph, with images as nodes and edges connecting those with shared consistent-anomaly patterns, using community detection to filter these anomalies. We provide a theoretical foundation using Extreme Value Theory to explain the effectiveness of our approach. Experiments on MVTec AD with the ViT-L-14-336 backbone achieve 98.3% AUROC for AC and AS performance of 66.8% (+4.2%) F1 and 68.1% (+5.4%) AP over state-of-the-art zero-shot methods. Using the DINOv2 backbone further improves segmentation, yielding 69.1% (+6.5%) F1 and 71.9% (+9.2%) AP, demonstrating robustness across architectures.[200] Learning from Disagreement: A Group Decision Simulation Framework for Robust Medical Image Segmentation
Chen Zhong,Yuxuan Yang,Xinyue Zhang,Ruohan Ma,Yong Guo,Gang Li,Jupeng Li
Main category: cs.CV
TL;DR: 提出一种新的群体决策模拟框架,通过模拟临床专家组协作决策过程,将专家标注分歧视为有用信号而非噪声,显著提升医学图像分割性能。
Details
Motivation: 医学图像分割标注存在专家间变异性(IRV),传统方法简单平均标签丢弃了有价值的临床不确定性信息。 Method: 设计专家签名生成器(ESG)学习每个标注者的风格表征,结合模拟会诊模块(SCM)从潜在空间智能生成最终分割结果。 Result: 在CBCT和MRI数据集上达到最先进的92.11%和90.72% Dice分数。 Conclusion: 将专家分歧建模为有用信号可提升AI系统在医疗场景下的鲁棒性和可信度。 Abstract: Medical image segmentation annotation suffers from inter-rater variability (IRV) due to differences in annotators' expertise and the inherent blurriness of medical images. Standard approaches that simply average expert labels are flawed, as they discard the valuable clinical uncertainty revealed in disagreements. We introduce a fundamentally new approach with our group decision simulation framework, which works by mimicking the collaborative decision-making process of a clinical panel. Under this framework, an Expert Signature Generator (ESG) learns to represent individual annotator styles in a unique latent space. A Simulated Consultation Module (SCM) then intelligently generates the final segmentation by sampling from this space. This method achieved state-of-the-art results on challenging CBCT and MRI datasets (92.11% and 90.72% Dice scores). By treating expert disagreement as a useful signal instead of noise, our work provides a clear path toward more robust and trustworthy AI systems for healthcare.[201] Post-TIPS Prediction via Multimodal Interaction: A Multi-Center Dataset and Framework for Survival, Complication, and Portal Pressure Assessment
Junhao Dong,Dejia Liu,Ruiqi Ding,Zongxing Chen,Yingjie Huang,Zhu Meng,Jianbo Zhao,Zhicheng Zhao,Fei Su
Main category: cs.CV
TL;DR: 本文提出了MultiTIPS,首个用于经颈静脉肝内门体分流术(TIPS)预后预测的公开多中心数据集,并基于此提出了一种新的多模态预后框架,通过双路径分割、多模态交互和多任务预测模块实现更准确、鲁棒且可解释的预后模型。
Details
Motivation: 现有TIPS预后预测模型面临标注耗时、单模态方法泛化性差以及单一终点评估不全面的问题,且缺乏公开数据集,限制了研究发展。 Method: 提出一个包含三部分的多模态框架:1)双选项分割,结合半监督与基础模型提升ROI分割效率;2)多模态交互,引入多粒度放射组学注意力(MGRA)、渐进正交解耦(POD)和临床引导预后增强(CGPE)促进跨模态特征融合;3)多任务预测,采用分阶段训练策略同时优化生存、门脉压力梯度(PPG)和肝性脑病(OHE)预测。 Result: 在MultiTIPS数据集上的实验表明,该方法优于现有最先进方法,具有良好的跨域泛化能力和模型可解释性。 Conclusion: 所提多模态框架在TIPS预后预测中表现出优越性能,具备临床应用潜力,同时MultiTIPS数据集的发布有望推动该领域研究进展。 Abstract: Transjugular intrahepatic portosystemic shunt (TIPS) is an established procedure for portal hypertension, but provides variable survival outcomes and frequent overt hepatic encephalopathy (OHE), indicating the necessity of accurate preoperative prognostic modeling. Current studies typically build machine learning models from preoperative CT images or clinical characteristics, but face three key challenges: (1) labor-intensive region-of-interest (ROI) annotation, (2) poor reliability and generalizability of unimodal methods, and (3) incomplete assessment from single-endpoint prediction. Moreover, the lack of publicly accessible datasets constrains research in this field. Therefore, we present MultiTIPS, the first public multi-center dataset for TIPS prognosis, and propose a novel multimodal prognostic framework based on it. The framework comprises three core modules: (1) dual-option segmentation, which integrates semi-supervised and foundation model-based pipelines to achieve robust ROI segmentation with limited annotations and facilitate subsequent feature extraction; (2) multimodal interaction, where three techniques, multi-grained radiomics attention (MGRA), progressive orthogonal disentanglement (POD), and clinically guided prognostic enhancement (CGPE), are introduced to enable cross-modal feature interaction and complementary representation integration, thus improving model accuracy and robustness; and (3) multi-task prediction, where a staged training strategy is used to perform stable optimization of survival, portal pressure gradient (PPG), and OHE prediction for comprehensive prognostic assessment. Extensive experiments on MultiTIPS demonstrate the superiority of the proposed method over state-of-the-art approaches, along with strong cross-domain generalization and interpretability, indicating its promise for clinical application. The dataset and code are available.[202] When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Jinjin Cao,Zhiyang Chen,Zijun Wang,Liyuan Ma,Weijian Luo,Guojun Qi
Main category: cs.CV
TL;DR: 提出了一种无需训练的解码方法Cross-Modal Guidance (CMG),通过削弱视觉-语言注意力来减少视觉-语言模型中的语言偏见,从而有效缓解幻觉问题。
Details
Motivation: 现有视觉-语言模型(VLMs)存在严重幻觉问题,即生成的语言响应流畅但与图像内容无关,主要源于语言模态的过度主导(语言偏见)。 Method: 引入CMG方法,在解码过程中自适应地屏蔽关键transformer层中最具影响力的图像token的注意力权重,构建一个视觉-语言注意力退化的模型,并利用其输出分布与原模型的差异来引导生成,增强对视觉上下文的感知。 Result: 在多个幻觉专用基准上验证了CMG的有效性,显著降低了不同VLM的幻觉现象,且无需额外训练或条件,具有良好的通用性。 Conclusion: CMG是一种高效、通用且无需训练的解码策略,通过跨模态对比机制有效缓解了VLM中的语言偏见导致的幻觉问题,提升了模型的可靠性。 Abstract: Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.[203] DAGLFNet:Deep Attention-Guided Global-Local Feature Fusion for Pseudo-Image Point Cloud Segmentation
Chuang Chen,Wenyi Ge
Main category: cs.CV
TL;DR: 提出了一种基于伪图像的语义分割框架DAGLFNet,通过全局-局部特征融合、多分支特征提取和深度特征引导注意力机制,有效提升点云语义分割的性能与实时性。
Details
Motivation: 现有伪图像方法在处理非结构化点云时往往忽略结构和语义细节,导致特征融合能力不足,难以兼顾效率与性能。 Method: 设计了全局-局部特征融合编码模块以增强局部特征关联并捕获全局上下文;采用多分支特征提取网络增强轮廓特征判别力;引入深度特征引导的注意力机制优化跨通道特征融合。 Result: 在SemanticKITTI和nuScenes验证集上分别达到69.83%和78.65%的mIoU,兼具高性能与实时性。 Conclusion: DAGLFNet在保持实时处理能力的同时显著提升了LiDAR点云语义分割精度,具有良好的实际应用潜力。 Abstract: Environmental perception systems play a critical role in high-precision mapping and autonomous navigation, with LiDAR serving as a core sensor that provides accurate 3D point cloud data. How to efficiently process unstructured point clouds while extracting structured semantic information remains a significant challenge, and in recent years, numerous pseudo-image-based representation methods have emerged to achieve a balance between efficiency and performance. However, they often overlook the structural and semantic details of point clouds, resulting in limited feature fusion and discriminability. In this work, we propose DAGLFNet, a pseudo-image-based semantic segmentation framework designed to extract discriminative features. First, the Global-Local Feature Fusion Encoding module is used to enhance the correlation among local features within a set and capture global contextual information. Second, the Multi-Branch Feature Extraction network is employed to capture more neighborhood information and enhance the discriminability of contour features. Finally, a Feature Fusion via Deep Feature-guided Attention mechanism is introduced to improve the precision of cross-channel feature fusion. Experimental evaluations show that DAGLFNet achieves 69.83\% and 78.65\% on the validation sets of SemanticKITTI and nuScenes, respectively. The method balances high performance with real-time capability, demonstrating great potential for LiDAR-based real-time applications.[204] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition
Deng Li,Jun Shao,Bohao Xing,Rong Gao,Bihan Wen,Heikki Kälviäinen,Xin Liu
Main category: cs.CV
TL;DR: 提出了一种用于微手势识别的运动感知状态融合Mamba(MSF-Mamba),通过融合局部上下文状态增强对长距离和局部时空依赖的建模能力,并引入多尺度版本MSF-Mamba+,在公开数据集上实现了最先进的性能。
Details
Motivation: 现有模型如CNN难以捕捉长距离依赖,Transformer计算成本高,而原始Mamba缺乏对局部时空依赖和运动感知的建模能力,限制了其在微手势识别中的表现。 Method: 设计了运动感知状态融合模块(基于中心帧差分CFD),融合局部邻近状态以增强Mamba的局部时空建模能力;提出多尺度版本MSF-Mamba+,引入自适应尺度加权模块动态融合不同尺度的状态。 Result: 在两个公开微手势识别数据集上实验表明,即使是轻量级的MSF-Mamba也优于现有的CNN、Transformer和SSM模型,达到最先进水平,同时保持高效性。 Conclusion: MSF-Mamba有效克服了原始Mamba在微手势识别中缺乏局部时空建模和运动感知的问题,通过显式的运动感知状态融合机制,实现了高性能与高效率的平衡。 Abstract: Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.[205] Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng,Guangyi Chen,Tianpei Gu,Lingjing Kong,Yan Li,Zeyu Tang,Kun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于三角一致性原则的自优化框架,使视觉语言模型(VLMs)能够从无标签图像中自动生成高质量的指令数据并实现自主学习,在无需外部监督的情况下取得稳定提升。
Details
Motivation: 探索在没有外部监督(如人工标注或环境反馈)的情况下,视觉语言模型是否具备内在的自优化能力,从而实现自主学习。 Method: 提出基于三角一致性(图像-查询-答案)的自优化框架:通过多任务指令调优赋予模型生成能力,从无标签图像生成三元组,并利用三角一致性原则过滤合成数据,进而用于模型迭代更新。同时从因果视角进行理论分析。 Result: 以LLaVA-1.5为基线,实验表明模型能在多个基准上自主实现稳定但适度的性能提升,验证了VLMs具备内在的自优化能力。 Conclusion: 视觉语言模型具备无需外部监督的自优化潜力,所提出的自优化框架和三角一致性原则为理解VLM的学习机制提供了新视角,有望推动未来自主学习研究。 Abstract: Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.[206] Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Jiaye Li,Baoyou Chen,Hui Li,Zilong Dong,Jingdong Wang,Siyu Zhu
Main category: cs.CV
TL;DR: 本文提出HARoPE,一种头自适应的旋转位置编码扩展方法,通过引入可学习的线性变换(SVD参数化)来动态调整频率分配、实现旋转变换平面的语义对齐,并为不同注意力头定制位置感受野,显著提升了Transformer在细粒度图像生成中的表现。
Details
Motivation: 标准多维RoPE在图像生成中存在频率分配僵化、轴向独立性和头统一处理等问题,难以建模复杂的结构先验,限制了其在细粒度空间关系、颜色线索和对象计数等任务上的表现。 Method: 提出HARoPE,在RoPE前插入一个基于SVD参数化的可学习线性变换,实现头自适应的动态频率重分配和旋转变换平面的语义对齐,同时保持RoPE的相对位置不变性。 Result: 在ImageNet分类条件生成和文本到图像生成(Flux、MMDiT)任务上,HARoPE consistently优于强RoPE基线及其他变体,展现出更好的生成质量和结构建模能力。 Conclusion: HARoPE是一种轻量级、即插即用的位置编码改进方案,有效增强了Transformer在图像生成中的位置感知能力,为多维RoPE的设计提供了新的原则性思路。 Abstract: Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.[207] Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking
Yuteng Ye,Zheng Zhang,Qinchuan Zhang,Di Wang,Youjia Zhang,Wenxiao Zhang,Wei Yang,Yuan Liu
Main category: cs.CV
TL;DR: Jigsaw3D提出了一种基于多视图扩散模型的可控3D风格迁移方法,通过“拼图操作”(空间打乱和随机遮罩)分离风格与内容,实现快速且视图一致的3D资产纹理重绘。
Details
Motivation: 现有3D风格迁移方法依赖直接风格注入或2D扩散模型的分数蒸馏,导致每场景优化耗时长,且容易将风格与语义内容纠缠。因此需要一种能解耦风格与内容、提升效率和一致性的新方法。 Method: 引入拼图操作(jigsaw operation),对参考图像进行空间打乱和随机遮罩,抑制物体语义并提取纯风格特征(如色彩、笔触、纹理);通过参考-视图交叉注意力机制将风格融入多视图扩散模型,生成视图一致的风格化渲染结果,并将风格烘焙到3D表面形成无缝纹理。 Result: 在标准3D风格迁移基准上,Jigsaw3D实现了更高的风格保真度和多视图一致性,显著降低延迟;同时支持部分遮罩参考风格迁移、多物体场景风格化和可平铺纹理生成。 Conclusion: Jigsaw3D有效解耦了风格与内容,提供了一种高效、灵活且通用的3D风格迁移方案,在多种应用场景下表现出优越性能。 Abstract: Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/[208] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Qunzhong Wang,Jie Liu,Jiajun Liang,Yilei Jiang,Yuanxing Zhang,Jinyuan Chen,Yaozhi Zheng,Xintao Wang,Pengfei Wan,Xiangyu Yue,Jiaheng Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为VideoReward Thinker(VR-Thinker)的“图像思考”框架,通过引入视觉推理操作和可配置的视觉记忆窗口,解决现有多模态奖励模型在处理视频时上下文受限和信息遗忘的问题。
Details
Motivation: 现有视觉奖励模型因视觉输入占用过多上下文资源,导致帧数受限、细节丢失,并在推理过程中容易产生幻觉和遗忘问题。 Method: 提出VR-Thinker框架,支持选择帧等视觉推理操作和动态视觉记忆窗口;采用强化学习微调流程,包括冷启动、拒绝采样微调和组相对策略优化(GRPO),以提升推理能力。 Result: 在多个视频偏好基准上达到开源模型中的最先进水平:7B模型在VideoGen Reward上达80.5%,GenAI-Bench上达82.3%,MJ-Bench-Video上达75.6%,尤其在长视频上表现更优。 Conclusion: VR-Thinker通过主动获取和更新视觉证据,在有限上下文中显著提升了多模态奖励模型的推理保真度与可靠性,验证了‘图像思考’建模范式的有效性。 Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.[209] Receptive Field Expanded Look-Up Tables for Vision Inference: Advancing from Low-level to High-level Tasks
Xi Zhang,Xiaolin Wu
Main category: cs.CV
TL;DR: 提出一种基于最优格向量量化的LUT方法,通过自适应分配量化精度和扩展感受野,在不增加存储开销的情况下提升CNN推理速度与精度。
Details
Motivation: 现有查找表(LUT)方法因组合爆炸问题导致卷积核感受野受限,影响了CNN在高速推理中的性能表现。 Method: 提出学习最优格向量量化器,结合不规则空洞卷积和U型级联LUT结构,以固定表大小扩展感受野,并更精确地逼近卷积核。 Result: 该方法在保持相同空间复杂度的前提下,显著提升了LUT驱动的CNN推理性能,有效平衡了速度、精度与内存效率。 Conclusion: 所提出的LUT优化策略克服了传统方法感受野受限的问题,为快速CNN推理提供了更高效、准确的解决方案。 Abstract: Recently, several look-up table (LUT) methods were developed to greatly expedite the inference of CNNs in a classical strategy of trading space for speed. However, these LUT methods suffer from a common drawback of limited receptive field of the convolution kernels due to the combinatorial explosion of table size. This research aims to expand the CNN receptive field with a fixed table size, thereby enhancing the performance of LUT-driven fast CNN inference while maintaining the same space complexity. To achieve this goal, various techniques are proposed. The main contribution is a novel approach of learning an optimal lattice vector quantizer that adaptively allocates the quantization resolution across data dimensions based on their significance to the inference task. In addition, the lattice vector quantizer offers an inherently more accurate approximation of CNN kernels than scalar quantizer as used in current practice. Furthermore, we introduce other receptive field expansion strategies, including irregular dilated convolutions and a U-shaped cascaded LUT structure, designed to capture multi-level contextual information without inflating table size. Together, these innovations allow our approach to effectively balance speed, accuracy, and memory efficiency, demonstrating significant improvements over existing LUT methods.[210] Unified Open-World Segmentation with Multi-Modal Prompts
Yang Liu,Yufei Yin,Chenchen Jing,Muzhi Zhu,Hao Chen,Yuling Xi,Bo Feng,Hao Wang,Shiyu Li,Chunhua Shen
Main category: cs.CV
TL;DR: COSINE是一个统一的开放世界分割模型,结合了开放词汇分割和上下文分割,支持多模态提示(如文本和图像),通过统一架构实现跨粒度的掩码生成。
Details
Motivation: 现有开放词汇分割和上下文分割方法存在架构差异、学习目标不一致和表示学习策略不同,限制了性能提升。 Method: 利用基础模型提取图像和多模态提示的表示,通过SegDecoder对齐表示、建模交互并生成指定掩码。 Result: 在开放词汇和上下文分割任务上均显著提升性能,多模态提示协同作用增强了泛化能力。 Conclusion: COSINE通过统一框架有效整合多模态提示,在开放世界分割中表现出优越的性能和泛化能力。 Abstract: In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.[211] Layout-Independent License Plate Recognition via Integrated Vision and Language Models
Elham Shabaninia,Fatemeh Asadi-zeydabadi,Hossein Nezamabadi-pour
Main category: cs.CV
TL;DR: 提出了一种模式感知的自动车牌识别(ALPR)框架,结合基于Transformer的视觉模型与迭代语言建模机制,实现对多种车牌布局和复杂现实条件下的高精度、鲁棒识别。
Details
Motivation: 传统ALPR系统依赖启发式规则或手动分类处理不同车牌布局,难以适应多样化和非标准情况,因此需要一种无需显式规则、能自适应学习车牌结构模式的方法。 Method: 采用高精度检测网络进行车牌定位,随后通过整合Transformer视觉模型与迭代语言建模的统一识别阶段,联合优化视觉与语言线索,实现字符识别与OCR后处理的一体化。 Result: 在多个国际数据集(IR-LPR, UFPR-ALPR, AOLP)上优于现有的无分割方法,展现出更高的识别准确率和对噪声、变形、非常规字体的鲁棒性。 Conclusion: 将模式分析嵌入识别阶段可有效融合计算机视觉与语言建模,提升ALPR系统在复杂真实场景中的适应性和性能,适用于智能交通与监控应用。 Abstract: This work presents a pattern-aware framework for automatic license plate recognition (ALPR), designed to operate reliably across diverse plate layouts and challenging real-world conditions. The proposed system consists of a modern, high-precision detection network followed by a recognition stage that integrates a transformer-based vision model with an iterative language modelling mechanism. This unified recognition stage performs character identification and post-OCR refinement in a seamless process, learning the structural patterns and formatting rules specific to license plates without relying on explicit heuristic corrections or manual layout classification. Through this design, the system jointly optimizes visual and linguistic cues, enables iterative refinement to improve OCR accuracy under noise, distortion, and unconventional fonts, and achieves layout-independent recognition across multiple international datasets (IR-LPR, UFPR-ALPR, AOLP). Experimental results demonstrate superior accuracy and robustness compared to recent segmentation-free approaches, highlighting how embedding pattern analysis within the recognition stage bridges computer vision and language modelling for enhanced adaptability in intelligent transportation and surveillance applications.[212] MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates
Binyu Zhao,Wei Zhang,Zhaonian Zou
Main category: cs.CV
TL;DR: 提出了一种新的多模态学习方法MCE,以应对缺失模态尤其是不平衡缺失率下的挑战,通过增强学习能力和表征能力,在多个基准上优于现有方法。
Details
Motivation: 处理多模态学习中模态缺失不平衡的问题,避免高缺失率模态因更新少导致的学习不一致和表征退化。 Method: MCE包含两个部分:LCE通过多层次因子动态平衡各模态学习进度;RCE通过子集预测和跨模态补全任务提升特征语义和鲁棒性。 Result: 在四个多模态基准上的实验表明,MCE在多种缺失配置下均优于当前最先进的方法。 Conclusion: MCE有效缓解了不平衡缺失率带来的学习不一致和表征退化问题,显著提升了多模态模型的性能。 Abstract: Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.[213] GLOFNet -- A Multimodal Dataset for GLOF Monitoring and Prediction
Zuha Fatima,Muhammad Anser Sohaib,Muhammad Talha,Sidra Sultana,Ayesha Kanwal,Nazia Perwaiz
Main category: cs.CV
TL;DR: 提出GLOFNet,一个用于冰川湖溃洪(GLOF)监测与预测的多模态数据集,整合了Sentinel-2影像、冰川运动速度和地表温度数据,经过预处理和融合,公开用于支持高山灾害预测研究。
Details
Motivation: 现有GLOF预测研究受限于碎片化和单模态数据,缺乏结合视觉特征与物理前兆的统一数据集,难以支持预测模型开发。 Method: 构建GLOFNet数据集,整合Sentinel-2多光谱影像、NASA ITS_LIVE冰川速度产品和MODIS地表温度数据,进行云遮蔽、质量过滤、归一化、时间插值、增强和周期编码等预处理,并实现多模态数据融合。 Result: 揭示了季节性冰川速度变化、每十年约0.8K的长期变暖趋势以及冰雪条件的空间异质性;数据集公开发布,支持对罕见灾害的多模态深度学习模型 benchmark。 Conclusion: GLOFNet为GLOF预测提供了结构化的多模态数据基础,解决了类别不平衡、云污染和分辨率低等问题,推动了基于深度学习的高山灾害预测研究。 Abstract: Glacial Lake Outburst Floods (GLOFs) are rare but destructive hazards in high mountain regions, yet predictive research is hindered by fragmented and unimodal data. Most prior efforts emphasize post-event mapping, whereas forecasting requires harmonized datasets that combine visual indicators with physical precursors. We present GLOFNet, a multimodal dataset for GLOF monitoring and prediction, focused on the Shisper Glacier in the Karakoram. It integrates three complementary sources: Sentinel-2 multispectral imagery for spatial monitoring, NASA ITS_LIVE velocity products for glacier kinematics, and MODIS Land Surface Temperature records spanning over two decades. Preprocessing included cloud masking, quality filtering, normalization, temporal interpolation, augmentation, and cyclical encoding, followed by harmonization across modalities. Exploratory analysis reveals seasonal glacier velocity cycles, long-term warming of ~0.8 K per decade, and spatial heterogeneity in cryospheric conditions. The resulting dataset, GLOFNet, is publicly available to support future research in glacial hazard prediction. By addressing challenges such as class imbalance, cloud contamination, and coarse resolution, GLOFNet provides a structured foundation for benchmarking multimodal deep learning approaches to rare hazard prediction.[214] MRS-YOLO Railroad Transmission Line Foreign Object Detection Based on Improved YOLO11 and Channel Pruning
Siyuan Liu,Junting Lin
Main category: cs.CV
TL;DR: 本文提出了一种改进的MRS-YOLO算法,用于铁路环境中输电线路异物检测,通过多尺度特征融合、重校准特征金字塔网络和空间通道重构检测头提升检测精度与效率。
Details
Motivation: 针对铁路环境下输电线路异物检测中存在的漏检、误检和检测效率低的问题。 Method: 提出了C3k2_MAKDF模块、RCFPN结构和SC_Detect检测头,并结合通道剪枝技术优化模型。 Result: mAP50和mAP50:95分别达到94.8%和86.4%,较基线提高0.7和2.3个百分点,参数量和GFLOPs分别减少44.2%和17.5%。 Conclusion: 改进的MRS-YOLO算法在检测精度和效率上均有显著提升,更适合应用于铁路输电线路异物检测任务。 Abstract: Aiming at the problems of missed detection, false detection and low detection efficiency in transmission line foreign object detection under railway environment, we proposed an improved algorithm MRS-YOLO based on YOLO11. Firstly, a multi-scale Adaptive Kernel Depth Feature Fusion (MAKDF) module is proposed and fused with the C3k2 module to form C3k2_MAKDF, which enhances the model's feature extraction capability for foreign objects of different sizes and shapes. Secondly, a novel Re-calibration Feature Fusion Pyramid Network (RCFPN) is designed as a neck structure to enhance the model's ability to integrate and utilize multi-level features effectively. Then, Spatial and Channel Reconstruction Detect Head (SC_Detect) based on spatial and channel preprocessing is designed to enhance the model's overall detection performance. Finally, the channel pruning technique is used to reduce the redundancy of the improved model, drastically reduce Parameters and Giga Floating Point Operations Per Second (GFLOPs), and improve the detection efficiency. The experimental results show that the mAP50 and mAP50:95 of the MRS-YOLO algorithm proposed in this paper are improved to 94.8% and 86.4%, respectively, which are 0.7 and 2.3 percentage points higher compared to the baseline, while Parameters and GFLOPs are reduced by 44.2% and 17.5%, respectively. It is demonstrated that the improved algorithm can be better applied to the task of foreign object detection in railroad transmission lines.[215] Deep semi-supervised approach based on consistency regularization and similarity learning for weeds classification
Farouq Benchallal,Adel Hafiane,Nicolas Ragot,Raphael Canals
Main category: cs.CV
TL;DR: 提出一种结合一致性正则化与相似性学习的深度半监督方法,用于杂草分类,有效利用未标记数据并提升在标注数据稀缺条件下的分类性能。
Details
Motivation: 由于杂草与作物形态相似、田间环境多变,且标注数据耗时费力,导致深度学习在杂草识别中面临标注数据稀缺的问题,因此需要提高对未标注数据的利用效率。 Method: 提出一种基于深度自编码器架构的半监督方法,结合一致性正则化和相似性学习,利用少量标注数据和大量未标注数据进行联合训练。 Result: 在DeepWeeds数据集上的实验表明,该方法在正常和噪声条件下均优于现有的全监督深度学习模型,且消融研究验证了所提联合学习策略的有效性。 Conclusion: 所提出的半监督方法能有效利用未标注数据,在标注数据有限的情况下实现鲁棒且高性能的杂草分类,具有在精准农业中应用的潜力。 Abstract: Weed species classification represents an important step for the development of automated targeting systems that allow the adoption of precision agriculture practices. To reduce costs and yield losses caused by their presence. The identification of weeds is a challenging problem due to their shared similarities with crop plants and the variability related to the differences in terms of their types. Along with the variations in relation to changes in field conditions. Moreover, to fully benefit from deep learning-based methods, large fully annotated datasets are needed. This requires time intensive and laborious process for data labeling, which represents a limitation in agricultural applications. Hence, for the aim of improving the utilization of the unlabeled data, regarding conditions of scarcity in terms of the labeled data available during the learning phase and provide robust and high classification performance. We propose a deep semi-supervised approach, that combines consistency regularization with similarity learning. Through our developed deep auto-encoder architecture, experiments realized on the DeepWeeds dataset and inference in noisy conditions demonstrated the effectiveness and robustness of our method in comparison to state-of-the-art fully supervised deep learning models. Furthermore, we carried out ablation studies for an extended analysis of our proposed joint learning strategy.[216] UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue,Haiyu Zhang,Xiangyu Zeng,Boyu Chen,Chenting Wang,Shaobin Zhuang,Lu Dong,KunPeng Du,Yi Wang,Limin Wang,Yali Wang
Main category: cs.CV
TL;DR: 本文提出了一种通用且统一的视觉tokenizer——UniFlow,通过自适应自蒸馏和轻量化的patch-wise像素流解码器,在视觉理解与生成任务之间实现双赢。
Details
Motivation: 现有tokenizer在视觉理解和生成任务之间存在性能权衡,源于高层语义抽象与低层像素重建之间的内在冲突,本文旨在解决这一矛盾以推动通用建模发展。 Method: 引入层间自适应自蒸馏机制,使预训练视觉编码器同时保留强语义特征并适应细粒度生成;设计轻量级patch-wise像素流解码器,通过从噪声状态回溯到像素域实现高保真重建,并以语义特征作为解码条件缓解训练冲突。 Result: 在13个跨7类任务的基准上验证了UniFlow的有效性,7B参数的UniFlow-XL在理解任务上平均超越14B的TokenFlow-XL达7.75%,在重建与生成任务中也优于UniTok,rFID提升0.15,gFID提升0.09(无引导)。 Conclusion: UniFlow实现了视觉理解与生成能力的协同优化,打破了传统tokenizer的性能权衡,为统一视觉建模提供了高效可行的解决方案。 Abstract: Tokenizer is a crucial component for both visual understanding and generation. To advance toward the ultimate goal of universal modeling, recent research has focused on developing a unified tokenizer. However, existing tokenizers face a significant performance trade-off between understanding and generation, stemming from the inherent conflict between high-level semantic abstraction and low-level pixel reconstruction. To tackle this challenge, we propose a generic and unified tokenizer, namely UniFlow, by flexibly adapting any visual encoder with a concise reconstruction decoder. Specifically, we introduce layer-wise adaptive self-distillation applied to the well-pretrained visual encoders, which enables UniFlow to simultaneously inherit the strong semantic features for visual understanding and flexibly adapt to model fine-grained details for visual generation. Moreover, we propose a lightweight patch-wise pixel flow decoder, which efficiently achieves high-fidelity pixel reconstruction by modeling a conditional flow from the noisy state back to the patch-wise pixel domain. By leveraging the semantic features as visual conditions for the decoder, we effectively alleviate the training conflicts between understanding and generation. Furthermore, the patch-wise learning strategy simplifies the data distribution, thereby improving training efficiency. Extensive experiments across 13 challenging benchmarks spanning 7 widely studied visual understanding and generation tasks demonstrate that UniFlow achieves a win-win outcome. For instance, our 7B UniFlow-XL not only surpasses the 14B TokenFlow-XL by 7.75% on average understanding benchmarks, but also achieves competitive results in both visual reconstruction and generation, surpassing UniTok by 0.15 in rFID and 0.09 in gFID (without guidance), respectively.[217] Injecting Frame-Event Complementary Fusion into Diffusion for Optical Flow in Challenging Scenes
Haonan Wang,Hanyu Zhou,Haoyue Liu,Luxin Yan
Main category: cs.CV
TL;DR: 提出基于扩散模型的Diff-ABFlow框架,结合帧相机与事件相机的外观-边界融合,用于改善高速低光场景下的光流估计。
Details
Motivation: 传统光流方法在高速和低光场景下面临运动模糊和光照不足问题,导致纹理减弱、噪声增加,影响外观饱和度和边界完整性,现有方法难以有效保持外观特征。 Method: 提出Diff-ABFlow框架,采用扩散模型学习从噪声光流到清晰光流的映射,避免受退化视觉特征影响,并结合帧相机的密集外观饱和与事件相机的密集边界完整性进行外观-边界融合。 Result: 该方法在恶劣视觉条件下实现了更完整和准确的光流估计,提升了边界完整性和运动恢复精度。 Conclusion: Diff-ABFlow通过扩散模型与多模态融合策略,有效解决了高动态复杂场景下的光流估计难题,优于依赖判别式或生成式模型的传统方法。 Abstract: Optical flow estimation has achieved promising results in conventional scenes but faces challenges in high-speed and low-light scenes, which suffer from motion blur and insufficient illumination. These conditions lead to weakened texture and amplified noise and deteriorate the appearance saturation and boundary completeness of frame cameras, which are necessary for motion feature matching. In degraded scenes, the frame camera provides dense appearance saturation but sparse boundary completeness due to its long imaging time and low dynamic range. In contrast, the event camera offers sparse appearance saturation, while its short imaging time and high dynamic range gives rise to dense boundary completeness. Traditionally, existing methods utilize feature fusion or domain adaptation to introduce event to improve boundary completeness. However, the appearance features are still deteriorated, which severely affects the mostly adopted discriminative models that learn the mapping from visual features to motion fields and generative models that generate motion fields based on given visual features. So we introduce diffusion models that learn the mapping from noising flow to clear flow, which is not affected by the deteriorated visual features. Therefore, we propose a novel optical flow estimation framework Diff-ABFlow based on diffusion models with frame-event appearance-boundary fusion.[218] Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
Shizhen Zhao,Jiahui Liu,Xin Wen,Haoru Tan,Xiaojuan Qi
Main category: cs.CV
TL;DR: 本文研究了预训练视觉基础模型在分布外检测(OOD)任务中的应用,发现DINOv2模型无需微调即可实现优异性能,并提出MoFE模块和Dynamic-β Mixup策略以应对大语义空间下的挑战,显著提升了OOD检测效果。
Details
Motivation: 尽管预训练视觉基础模型在多种视觉任务中表现出色,但其在分布外检测中的潜力尚未被充分探索,本文旨在系统研究这一问题。 Method: 通过分析DINOv2等基础模型在无需微调情况下的OOD检测能力,并提出Mixture of Feature Experts(MoFE)模块来划分特征子空间,结合Dynamic-β Mixup策略动态调整插值权重,以优化复杂类别分布下的特征学习。 Result: 实验表明,所提方法在多个基准上显著优于基线方法,尤其在大语义空间场景下有效改善了OOD检测性能。 Conclusion: 预训练视觉基础模型天然适合OOD检测,而MoFE与Dynamic-β Mixup的引入可有效应对高复杂度语义空间中的决策边界难题,提升了检测性能。 Abstract: Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.[219] A Simple and Better Baseline for Visual Grounding
Jingchao Wang,Wenlong Zhang,Dingjiang Huang,Hong Wang,Yefeng Zheng
Main category: cs.CV
TL;DR: 本文提出了一种简单而有效的视觉定位基线方法FSVG,通过特征选择机制直接整合语言和视觉模态,利用语言指导视觉特征提取,并仅使用语言相关的视觉特征以降低计算成本,在多个基准数据集上实现了精度与效率的更好平衡。
Details
Motivation: 现有方法在多尺度图像上迭代执行,且每次迭代需缓存语言和视觉特征,带来额外开销。为简化实现并减少计算负担,本文旨在提出一种更高效的视觉定位方法。 Method: 将语言和视觉模态直接整合到整体网络架构中,避免复杂的迭代过程;并行利用语言信息促进模态间交互;引入基于相似度的特征选择机制,仅提取语言相关的视觉特征用于加速预测。 Result: 在多个基准数据集上的实验表明,FSVG在定位准确性和计算效率之间取得了优于当前最先进方法的平衡。 Conclusion: FSVG作为一种基于特征选择的简洁有效方法,在不依赖复杂迭代过程的前提下,显著提升了视觉定位的效率与实用性,具有良好的应用前景。 Abstract: Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.[220] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu,Liangyu Chen,Jiazhen Liu,Mingkang Zhu,Zhisheng Zhong,Bei Yu,Jiaya Jia
Main category: cs.CV
TL;DR: 本文提出了ViSurf,一种统一的视觉-语言大模型后训练范式,结合监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的优点,通过在RLVR rollout中注入真实标签并引入新的奖励控制策略,实现了性能超越单独SFT、RLVR及两阶段方法。
Details
Motivation: SFT虽然能通过外部指导注入新知识,但常导致次优性能;RLVR虽能增强推理能力,但在超出模型内部知识的任务上表现不佳。因此需要一种结合两者优势的新方法。 Method: 提出ViSurf,将SFT和RLVR目标函数统一推导,构建单一阶段的联合训练框架,在RLVR的rollout过程中注入真实标签,实现外部监督与内部强化的协同,并设计三种新的奖励控制策略以稳定训练过程。 Result: 在多个基准测试上的实验表明,ViSurf显著优于单独使用SFT、RLVR以及两阶段SFT→RLVR方法,消融研究验证了各组件的有效性。 Conclusion: ViSurf为LVLMs提供了一种更高效、统一的后训练范式,有效融合了外部监督与内部强化机制,提升了模型整体性能与泛化能力。 Abstract: Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.[221] OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment
Yiting Lu,Fengbin Guan,Yixin Gao,Yan Zhong,Xinge Peng,Jiakang Yuan,Yihao Liu,Bo Zhang,Xin Li,Zhibo Chen,Weisi Lin
Main category: cs.CV
TL;DR: OmniQuality-R 是一个统一的奖励建模框架,将多任务质量推理转化为连续且可解释的奖励信号,用于策略优化。
Details
Motivation: 现有的视觉评估方法通常局限于单一任务,缺乏跨任务的统一建模能力。受人类主观实验启发,不同任务需要不同的评估原则,因此需要一个能支持多任务、可解释的质量评估框架。 Method: 提出 OmniQuality-R 框架,通过拒绝采样构建包含推理路径的奖励建模数据集,形成可靠的思维链(CoT)数据用于监督微调;采用分组相对策略优化(GRPO),结合基于高斯的奖励函数实现连续评分预测;引入标准差过滤和熵门控机制以稳定强化学习过程。 Result: 在美学质量评估、技术质量评价和图文对齐三个关键图像质量评估任务上验证了该方法的有效性,表现出良好的训练稳定性与下游泛化能力。 Conclusion: OmniQuality-R 实现了多任务质量评估的统一建模,通过结构化推理和改进的优化机制,生成连续、可解释的奖励信号,显著提升策略优化效果。 Abstract: Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.[222] GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus
Zhaolin Hu,Kun Li,Hehe Fan,Yi Yang
Main category: cs.CV
TL;DR: 提出一种增强线性注意力机制的混合框架,通过增加注意力矩阵的秩和引入可学习的对数幂函数来提升图变换器中节点表示的可分性。
Details
Motivation: 现有线性注意力模型由于低秩投影结构和过于均匀的注意力分布,导致表达能力下降,影响节点分类性能。 Method: 在值矩阵上附加一个门控局部图网络分支以提高注意力映射的秩,并引入可学习的对数幂函数降低注意力分布熵,增强聚焦能力。 Result: 理论证明所提方法能提升嵌入表示的可分性,实验表明在同质和异质图基准上均取得有竞争力的性能,同时保持线性注意力的可扩展性。 Conclusion: 该混合框架有效缓解了线性注意力的表达能力退化问题,在保持效率的同时提升了模型表现。 Abstract: Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model's classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.[223] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis
Peiyin Chen,Zhuowei Yang,Hui Feng,Sheng Jiang,Rui Yan
Main category: cs.CV
TL;DR: DEMO是一种基于流匹配的生成框架,用于音频驱动的说话人像视频合成,能够实现唇部运动、头部姿态和眼神的解耦与高保真控制。
Details
Motivation: 现有的音频驱动说话头生成方法在时间连贯性和细粒度运动控制方面仍存在挑战。 Method: 提出了一种运动自编码器,构建了独立表示且近似正交化的解耦运动潜在空间,并在此基础上结合最优传输流匹配与Transformer预测器生成平滑的运动轨迹。 Result: 在多个基准上的实验表明,DEMO在视频真实感、唇音同步和运动保真度方面优于先前方法。 Conclusion: 细粒度运动解耦与基于流的生成建模相结合,为可控的说话头视频合成提供了新的有效范式。 Abstract: Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.[224] A Machine Learning Perspective on Automated Driving Corner Cases
Sebastian Schmidt,Julius Körner,Stephan Günnemann
Main category: cs.CV
TL;DR: 提出一种基于数据分布的机器学习方法,用于自动驾驶感知中的有效corner case识别,统一现有分类体系并在多个基准上表现优异。
Details
Motivation: 传统基于实例的corner case分类不可扩展,且忽视了机器学习模型对训练数据的泛化能力,缺乏数据覆盖视角。 Method: 提出一种新框架,从数据分布角度出发,实现单个样本的corner case识别,并引入雾增强的Lost & Found数据集以分析复合corner case。 Result: 该方法在多个标准基准(扩展自OOD检测基准)上表现出色,统一了现有的基于场景的corner case分类体系,并支持复合corner case分析。 Conclusion: 所提方法为corner case识别提供了有原则的基础,无需人工定义,具有良好的可扩展性和实际应用潜力。 Abstract: For high-stakes applications, like autonomous driving, a safe operation is necessary to prevent harm, accidents, and failures. Traditionally, difficult scenarios have been categorized into corner cases and addressed individually. However, this example-based categorization is not scalable and lacks a data coverage perspective, neglecting the generalization to training data of machine learning models. In our work, we propose a novel machine learning approach that takes the underlying data distribution into account. Based on our novel perspective, we present a framework for effective corner case recognition for perception on individual samples. In our evaluation, we show that our approach (i) unifies existing scenario-based corner case taxonomies under a distributional perspective, (ii) achieves strong performance on corner case detection tasks across standard benchmarks for which we extend established out-of-distribution detection benchmarks, and (iii) enables analysis of combined corner cases via a newly introduced fog-augmented Lost & Found dataset. These results provide a principled basis for corner case recognition, underlining our manual specification-free definition.[225] Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping
Hao Shan,Ruikai Li,Han Jiang,Yizhe Fan,Ziyang Yan,Bohan Li,Xiaoshuai Hao,Hao Zhao,Zhiyong Cui,Yilong Ren,Haiyang Yu
Main category: cs.CV
TL;DR: 本文提出了首个用于评估在线高精地图模型时间稳定性的综合基准,设计了多维稳定性评价框架,并引入了存在性、定位和形状稳定性的新指标,整合为统一的平均稳定性(mAS)分数。实验表明精度(mAP)与稳定性(mAS)是两个 largely 独立的性能维度,研究还分析了不同模型设计选择对二者的影响,强调应将时间稳定性作为核心评估标准之一。
Details
Motivation: 现有在线地图构建模型主要关注单帧映射精度,而忽略了实际动态环境中因传感器位移导致的地图结果不稳定问题,缺乏对映射稳定性的系统研究。 Method: 提出一个多维稳定性评估框架,包含存在性、定位和形状稳定性三个新指标,并构建统一的平均稳定性(mAS)评分标准;在42个模型上进行广泛实验,分析不同架构和训练因素对精度与稳定性的影响。 Result: 实验显示精度(mAP)与稳定性(mAS)是相对独立的性能维度;识别出若干影响高精度、高稳定性或两者兼备的关键模型设计因素;所提出的评估框架能有效衡量在线HD地图的时间稳定性。 Conclusion: 时间稳定性应被视为在线高精地图评估的核心指标之一,与精度并重;该工作推动了更可靠自动驾驶系统的发展,并将通过公开基准、工具包和模型促进领域内对稳定性的关注。 Abstract: As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame's mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.[226] Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Gaojian Wang,Feng Lin,Tong Wu,Zhisheng Yan,Kui Ren
Main category: cs.CV
TL;DR: 本文提出了FS-VFM,一种可扩展的自监督预训练框架,通过结合掩码图像建模和实例判别,学习真实人脸的鲁棒表征,并提出轻量级适配器FS-Adapter以高效迁移至多种人脸安全任务,在多个基准上表现优于现有方法。
Details
Motivation: 如何利用大量未标注的真实人脸数据,学习具有强泛化能力且可迁移的人脸表征,以提升跨领域人脸安全任务(如深度伪造检测、人脸反欺骗等)的性能。 Method: 提出FS-VFM框架,引入3C学习目标,结合掩码图像建模(MIM)与实例判别(ID),设计CRFR-P掩码策略增强局部一致性和区域连贯性,并通过自蒸馏机制建立局部到全局的对应关系;预训练后使用FS-Adapter进行高效迁移。 Result: 在11个公开基准上实验表明,FS-VFM在多种ViT规模和监督范式下均优于现有的视觉基础模型,甚至超过特定任务的SOTA方法,而FS-Adapter在保持高效的同时实现了优异的性能权衡。 Conclusion: FS-VFM能有效学习通用且鲁棒的人脸表征,显著提升跨数据集、跨域及未见生成模型下的人脸安全任务性能,为面向人脸安全的视觉基础模型提供了可行路径。 Abstract: With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.[227] AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
Yu Li,Menghan Xia,Gongye Liu,Jianhong Bai,Xintao Wang,Conglang Zhang,Yuxuan Lin,Ruihang Chu,Pengfei Wan,Yujiu Yang
Main category: cs.CV
TL;DR: 提出一种两阶段方法,利用预训练的文本到视频(T2V)模型进行4D场景下的视角预测,通过引入自适应学习分支和相机外参扩散分支,实现从动态视频生成中提取自然视角。
Details
Motivation: 受T2V模型在模拟现实世界几何和物理规律方面能力的启发,探索其作为隐式世界模型在4D场景中用于视角规划的潜力。 Method: 首先通过自适应学习分支将4D场景表示注入预训练T2V模型,使其生成包含视觉视角信息的视频;然后构建一个混合条件引导的相机外参去噪过程,引入相机外参扩散分支,以生成视频和4D场景为输入进行视角提取。 Result: 实验结果表明该方法优于现有竞争方法,消融研究验证了关键技术设计的有效性。 Conclusion: 证明了视频生成模型在真实世界4D交互中的潜在应用价值。 Abstract: Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.[228] Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li,Chaolei Tan,Haoxuan Chen,Jianxin Ma,Jian-Fang Hu,Wei-Shi Zheng,Jianhuang Lai
Main category: cs.CV
TL;DR: 本论文首次全面综述了从图像到视频的迁移学习方法,系统分类为冻结特征和修改特征两类,并分析其在多种视频-文本任务中的应用与效果,指出了当前挑战与未来研究方向。
Details
Motivation: 由于训练视频-语言基础模型需要大量数据和计算资源,研究者希望利用已有的图像-语言基础模型(ILFM)通过迁移学习来提升视频-文本理解任务的性能。 Method: 对现有ILFM进行总结,将其迁移至视频领域的方法分为‘冻结特征’和‘修改特征’两大类,并按任务细粒度(如时空定位、视频问答等)系统梳理其应用。同时进行了实验分析以评估不同范式的有效性。 Result: 系统归纳了图像到视频迁移学习的主流策略,揭示了不同方法在下游任务中的表现差异,并提供了详细的实证分析结果。 Conclusion: 该综述为基于ILFM的视频-文本学习提供了结构化路线图,明确了当前面临的挑战,并提出了未来有前景的研究方向,推动该领域的进一步发展。 Abstract: Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.[229] MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation
Yuxiang Luo,Qing Xu,Hai Huang,Yuqi Ouyang,Zhen Chen,Wenting Duan
Main category: cs.CV
TL;DR: 提出了一种新的多模态脑肿瘤分割框架MSM-Seg,通过双记忆分割范式整合多模态和跨切片信息,并采用类别无关提示实现高效准确的肿瘤分割。
Details
Motivation: 现有基于提示的分割方法忽略了跨模态相关性,且依赖于耗时的类别特定提示,限制了其在实际临床中的应用。 Method: 设计了模态-切片记忆注意力(MSMA)来利用输入扫描中的跨模态和跨切片关系;提出了多尺度类别无关提示编码器(MCP-Encoder)为解码提供肿瘤区域引导;构建了模态自适应融合解码器(MF-Decoder)以提升分割精度。 Result: 在多个MRI数据集上的实验表明,MSM-Seg在多模态转移瘤和胶质瘤分割任务中优于当前最先进的方法。 Conclusion: MSM-Seg通过有效融合多模态和跨切片信息以及类别无关提示,显著提升了脑肿瘤分割的准确性和实用性。 Abstract: Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.[230] Action-Dynamics Modeling and Cross-Temporal Interaction for Online Action Understanding
Xinyu Yang,Zheheng Jiang,Feixiang Zhou,Yihang Zhu,Na Lv,Nan Xing,Huiyu Zhou
Main category: cs.CV
TL;DR: 提出了一种新的状态特定模型(SSM),通过关键状态压缩、动作模式学习和跨时间交互,统一提升动作检测与预知性能。