Table of Contents
cs.CL [Back]
[1] Benchmark for Assessing Olfactory Perception of Large Language Models
Eftychia Makri,Nikolaos Nakis,Laura Sisson,Gigi Minsky,Leandros Tassiulas,Vahid Satarifard,Nicholas A. Christakis
Main category: cs.CL
TL;DR: 本文提出了嗅觉感知(OP)基准,用于评估大语言模型(LLMs)对气味的推理能力,涵盖8类任务共1010个问题;实验发现模型更依赖词汇关联而非分子结构推理,最佳模型整体准确率为64.4%,多语言集成可提升性能(AUROC=0.86)。
Details
Motivation: 当前LLM研究主要集中于视觉和听觉模态,而嗅觉作为重要感官尚未被系统评估,亟需构建专用基准以检验模型在该领域的推理能力。 Method: 构建包含8类嗅觉任务、1010个问题的OP基准,每题提供化合物名称和异构SMILES两种分子表示形式;在21种模型配置上进行评测,并扩展至21种语言以分析多语言集成效果。 Result: 化合物名称提示显著优于SMILES提示(平均提升约7个百分点);最佳模型整体准确率达64.4%;多语言集成模型在子集上AUROC达0.86。 Conclusion: LLMs已初步具备嗅觉推理能力,但主要依赖词汇统计关联,缺乏真正的分子结构理解;多语言协同可提升性能,表明跨语言知识迁移有助于弥补单一语言下的语义盲区。 Abstract: Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.[2] A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction
Muhammad Anis Al Hilmi,Neelansh Khare,Noel Framil Iglesias
Main category: cs.CL
TL;DR: 本研究比较了三种从KRS文档中提取信息的方法:纯LLM、正则+LLM混合方法、以及Camelot+LLM回退流水线,结果表明后者在准确率和计算效率上最优,尤其适用于算力受限环境。
Details
Motivation: 提升在计算资源受限(如仅CPU、无GPU)环境下,从学术类KRS文档中可靠高效地提取结构化信息的能力。 Method: 对比评估三种策略:1)纯LLM(Gemma 3/Phi 4/Qwen 2.5,本地Ollama运行);2)混合确定性方法(正则表达式+LLM);3)Camelot表格解析为主+LLM回退的流水线;使用EM和Levenshtein相似度(阈值0.7)评估。 Result: Camelot+LLM回退方案表现最佳:EM与LS达0.99–1.00,平均处理时间<1秒/PDF;Qwen 2.5:14b模型最稳健;混合方法在确定性元数据上比纯LLM更高效。 Conclusion: 融合确定性规则与LLM的方法显著提升了文本型学术文档信息抽取的可靠性与效率,尤其适合低算力部署场景。 Abstract: This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic - LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text based academic documents in computationally constrained environments.[3] Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models
Wanxin Li,Denver McNeney,Nivedita Prabhu,Charlene Zhang,Renee Barr,Matthew Kitching,Khanh Dao Duc,Anthony S. Boyce
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的方法,用于从招聘需求(req)中识别和优先排序特定于该需求的个人能力(PCs),在准确率和可靠性上接近人类专家水平。
Details
Motivation: 现有AI招聘工具难以捕捉特定招聘需求所要求的个人能力(PCs),仅依赖于岗位类别,限制了对优秀候选人的精准识别。 Method: 结合动态少样本提示、基于反思的自我改进、基于相似性的过滤及多阶段验证,构建LLM驱动的req-specific PCs识别框架。 Result: 在项目经理招聘需求数据集上,该方法平均准确率达0.76,接近人类专家的一致性水平,且误识别率(out-of-scope rate)仅为0.07。 Conclusion: 该LLM方法能有效提取并排序招聘需求中的关键个人能力,为AI招聘系统提供更精细化、需求导向的能力评估能力。 Abstract: AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.[4] Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
Jaeik Kim,Woojin Kim,Jihwan Hong,Yejoon Lee,Sieun Hyeon,Mintaek Lim,Yunseok Han,Dogeun Kim,Hoeun Lee,Hyunggeun Kim,Jaeyoung Do
Main category: cs.CL
TL;DR: Dynin-Omni 是首个基于掩码扩散的全模态基础模型,统一处理文本、图像、语音和视频,在共享离散token空间中通过双向上下文迭代优化,性能超越现有开源统一模型,并媲美单模态专家系统。
Details
Motivation: 解决现有统一模型(自回归式或组合式)在跨模态建模中的序列化限制或依赖外部解码器的问题,探索掩码扩散作为真正统一的全模态建模范式。 Method: 提出基于掩码扩散的全模态建模框架,采用共享离散token空间与双向上下文迭代优化;设计多阶段训练策略,包括基于模型融合的模态扩展和全模态对齐。 Result: 在19个跨模态基准上取得领先:GSM8K达87.6,MME-P达1733.6,VideoMME达61.4,GenEval达0.87,LibriSpeech WER为2.1,整体优于开源统一模型,接近单模态专家系统。 Conclusion: 掩码扩散可作为高效、灵活的全模态统一范式,支撑实时全模态系统、跨模态检索与生成及具身多模态智能体的发展。 Abstract: We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.[5] How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows
Songhee Han,Jueun Shin,Jiyoon Han,Bung-Woo Jun,Hilal Ayan Karabatman
Main category: cs.CL
TL;DR: 本研究评估了大语言模型(LLM)作为评判者(LLM-as-judge)在定性研究中对解释质量的评估能力,发现其虽能反映模型间大致优劣趋势,但评分幅度与人类判断存在显著偏差;其中‘连贯性’指标与人类评价一致性最强,而‘忠实性’和‘正确性’在非字面、细微解释上系统性失准;安全类指标与解释质量无关;因此LLM-as-judge更适合作为筛选低性能模型的辅助工具,而非替代人工判断。
Details
Motivation: 当前定性研究者在分析流程中直接引入大语言模型,缺乏对其解释质量的系统性评估与跨模型比较,导致模型选择缺乏依据,可能影响解释结果。 Method: 基于712段K-12数学教师访谈文本,使用5种主流推理模型生成单句解释;采用AWS Bedrock的LLM-as-judge框架在5个维度上自动评分,并由训练有素的人类评估员对分层子集就解释准确性、细微性保留和连贯性进行独立评分;对比分析自动评分与人类评分的一致性。 Result: LLM-as-judge评分能在模型层面反映人类评价的大致趋势,但数值偏差大;Coherence指标与人类综合评分对齐度最高;Faithfulness与Correctness在非字面/细微解释上系统性偏离;Safety类指标与解释质量无关。 Conclusion: LLM-as-judge方法适用于初步筛选或淘汰表现差的模型,但不能替代人类判断;研究为定性研究中LLM的系统化比较与选择提供了实践指导。 Abstract: As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock's LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.[6] Eyla: Toward an Identity-Anchored LLM Architecture with Integrated Biological Priors -- Vision, Implementation Attempt, and Lessons from AI-Assisted Development
Arif Aditto
Main category: cs.CL
TL;DR: 本文提出Eyla——一种以身份为中心的大语言模型(LLM)架构,旨在实现身份一致性(即在对抗压力下维持连贯自我模型、承认不确定性、抵抗操纵),并设计了新基准Identity Consistency Score(ICS);作者以非程序员身份尝试用AI编程助手(Claude Code、Cursor)实现该架构,耗资超1000美元却失败,仅产出一个1.27B参数、86个‘脑区’子系统贡献不足2%的无效模型;论文系统分析五类AI辅助开发新型架构的失败模式,并为AI系统与AI辅助软件工程领域提供实践教训。
Details
Motivation: 现有LLM优化目标是通用有用性,缺乏对身份一致性的建模能力;作者旨在构建能维持稳定自我模型、承认不确定性、抵抗操纵的LLM代理操作系统,尤其面向资源受限的消费级硬件。 Method: 提出Eyla架构,整合生物启发子系统:HiPPO初始化的状态空间模型、零初始化适配器、情景记忆检索、校准不确定性训练;设计Identity Consistency Score(ICS)作为新评估基准;采用AI编码助手(Claude Code、Cursor)进行端到端非专业开发实践,并系统记录失败过程与根因。 Result: 实现失败:最终模型参数量1.27B,86个‘脑区’子系统对输出贡献<2%;识别出五类AI辅助开发新型架构的系统性失败模式;提出针对性改进建议;首次将前沿架构构想与第一人称AI辅助开发失败分析结合。 Conclusion: 当前AI编程助手在支持高度原创、跨模块耦合强的LLM系统级架构开发时存在根本局限;身份一致性是可量化、值得独立追求的LLM能力维度;失败本身具有高信息密度,其结构化复盘对AI系统设计和AI辅助工程实践均具重要启示。 Abstract: We present the design rationale, implementation attempt, and failure analysis of Eyla, a proposed identity-anchored LLM architecture that integrates biologically-inspired subsystems -- including HiPPO-initialized state-space models, zero-initialized adapters, episodic memory retrieval, and calibrated uncertainty training -- into a unified agent operating system running on consumer hardware. Unlike existing approaches that optimize models for generic helpfulness, Eyla targets identity consistency: the ability to maintain a coherent self-model under adversarial pressure, admit uncertainty, and resist manipulation. We propose the Identity Consistency Score (ICS), a novel benchmark for evaluating this property across LLMs. We then present an honest account of attempting to implement this architecture using AI coding assistants (Claude Code, Cursor) as a non-programmer, documenting a $1,000+ failure that produced a 1.27B parameter model with 86 brain subsystems contributing less than 2% to output. Our analysis identifies five systematic failure modes of AI-assisted development for novel architectures and offers concrete recommendations. To our knowledge, this is the first paper to combine an architectural vision with a documented first-person failure analysis of AI-assisted LLM development, providing lessons for both the AI systems and AI-assisted software engineering communities.[7] Can LLMs Perceive Time? An Empirical Investigation
Aniketh Garikaparthi
Main category: cs.CL
TL;DR: 大型语言模型无法准确估计自身任务耗时,预估时间常比实际长4-7倍,相对排序和事后回忆也严重失准,根源在于缺乏对自身推理时间的体验性认知。
Details
Motivation: 探究大语言模型为何无法准确估计自身任务耗时,及其在代理调度、规划和时间敏感场景中的实际影响。 Method: 通过四项实验,在68个任务和四个模型家族上系统评估预任务估计、相对排序判断、事后回忆及多步代理场景下的时间估计能力。 Result: 预估时间超调4-7倍(p<0.001);相对排序在反直觉任务对上仅18%正确率(p=0.033);事后回忆误差达数量级;多步代理中误差仍为5-10倍。 Conclusion: 模型虽具备关于时间的命题性知识,但缺乏自身推理时间的体验性基础,导致系统性时间估计失败。 Abstract: Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18\% on counter-intuitive pairs, $p = 0.033$), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality -- estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5--10$\times$. The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.[8] Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
Mingjie Li,Wai Man Si,Michael Backes,Yang Zhang,Yisen Wang
Main category: cs.CL
TL;DR: 本文发现大型推理模型(LRMs)在后训练过程中会抑制原有安全机制,但并未完全移除;据此提出轻量级方法SafeReAct,通过LoRA适配器对关键层进行对齐,有效恢复安全性而不损害推理能力。
Details
Motivation: 大型语言模型经后训练或微调提升特定任务性能(如推理)后,常伴随安全性下降,亟需理解原因并提出低成本修复方案。 Method: 分析后训练对模型安全机制的影响,发现其掩盖而非删除原始安全机制;据此设计SafeReAct方法,利用少量层上的LoRA适配器对齐以恢复被抑制的安全行为。 Result: 在四个SOTA大型推理模型上验证,SafeReAct显著提升对有害提示的安全性,且不损害推理性能;在医疗等其他领域专用模型上也展现出通用性和有效性。 Conclusion: 后训练导致的安全性下降源于安全机制被掩盖而非删除,SafeReAct提供了一种轻量、高效、通用的安全恢复方案。 Abstract: Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs' safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.[9] MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
Miaosen Luo,Zhenhao Yang,Jieshen Long,Jinghu Sun,Yichu Liu,Sijie Mai
Main category: cs.CL
TL;DR: 本文提出了一种结合结构化判别-校准(DC)推理与提示式强化学习(Hint-GRPO)的新训练框架,以提升多模态情感分析模型的可解释性、鲁棒性与泛化能力。
Details
Motivation: 现有MLLMs在多模态情感分析中虽性能优异,但存在黑箱、可解释性差、CoT标注成本高、RL探索效率低及奖励稀疏等问题。 Method: 1)利用教师模型Qwen3Omni-30B合成高质量含DC结构的CoT数据进行冷启动监督微调;2)提出Hint-GRPO,在RL中将DC的判别阶段作为可验证锚点,为困难样本提供方向性提示,缓解奖励稀疏问题。 Result: 在Qwen2.5Omni-7B上实验表明,该方法在细粒度情感回归任务中精度更高,生成高质量结构化推理链,并在跨域评估中展现更强泛化能力。 Conclusion: 显式DC推理结构显著提升模型可解释性与鲁棒性,Hint-GRPO为构建可信高效的情感分析系统提供了新范式。 Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.[10] ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation
Serry Sibaee,Khloud Al Jallad,Zineb Yousfi,Israa Elsayed Elhosiny,Yousra El-Ghawi,Batool Balah,Omer Nacar
Main category: cs.CL
TL;DR: 本文提出了ASCAT,一个高质量的英阿双语科学翻译评测语料库,覆盖五个科学领域,包含经多引擎翻译与专家验证的完整科学摘要,并用于评测多个大模型的翻译性能。
Details
Motivation: 现有阿拉伯语-英语语料库多基于短句或单领域文本,缺乏覆盖多学科、长文本、高质量验证的科学翻译评测资源,本文旨在填补这一关键空白。 Method: 构建了系统化的多引擎翻译(Gemini、Hugging Face quickmt、Google Translate、DeepL)加领域专家(词汇、句法、语义三级)人工验证流程,采集并处理来自物理、数学、计算机科学、量子力学和人工智能五个领域的完整科学摘要。 Result: 建成ASCAT语料库,含67,293英文词符与60,026阿拉伯词符,阿拉伯词汇量17,604;在该语料上评测GPT-4o-mini(BLEU 37.07)、Gemini-3.0-Flash-Preview(BLEU 30.44)和Qwen3-235B-A22B(BLEU 23.68),验证其判别力。 Conclusion: ASCAT是首个面向多学科科学摘要、兼具规模、质量与语言特性的英阿平行评测基准,可支撑科学翻译质量的严格评估及领域专用翻译模型训练。 Abstract: We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \texttt{quickmt-en-ar}), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus contains 67,293 English tokens and 60,026 Arabic tokens, with an Arabic vocabulary of 17,604 unique words reflecting the morphological richness of the language. We benchmark three state-of-the-art LLMs on the corpus GPT-4o-mini (BLEU: 37.07), Gemini-3.0-Flash-Preview (BLEU: 30.44), and Qwen3-235B-A22B (BLEU: 23.68) demonstrating its discriminative power as an evaluation benchmark. ASCAT addresses a critical gap in scientific MT resources for Arabic and is designed to support rigorous evaluation of scientific translation quality and training of domain-specific translation models.[11] Are they human? Detecting large language models by probing human memory constraints
Simon Schug,Brenden M. Lake
Main category: cs.CL
TL;DR: 本文提出了一种利用人类工作记忆容量限制这一认知现象来区分在线参与者(人类)与大语言模型(LLM)的新方法,通过标准序列回忆任务的认知建模,即使LLM被指示模仿人类工作记忆限制,仍能有效识别其非人类特性。
Details
Motivation: 随着通用大语言模型(LLMs)在传统图灵测试类挑战中表现优异,以往基于简单人类优势任务的机器检测方法失效,威胁在线行为研究的有效性,亟需新的人机区分策略。 Method: 采用标准序列回忆任务,结合认知建模方法,分析人类受试者与LLMs(包括被明确指示模拟人类工作记忆限制的LLMs)的行为数据差异。 Result: 认知建模可稳健区分真实人类参与者与LLMs,即使后者刻意模仿人类工作记忆限制,仍表现出显著不同的行为模式。 Conclusion: 利用已确立的人类认知约束(如工作记忆容量限制)作为检测基准,是一种可行且有前景的人机鉴别新范式,有助于保障在线行为研究的效度。 Abstract: The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.[12] Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora
Orlova Anastasia
Main category: cs.CL
TL;DR: 本文利用分布语义学方法分析俄语科学与大众媒体中心理学术语的语义变迁,发现科学话语强调方法论和临床术语,而大众话语更侧重日常经验和治疗实践,语义从专业精确向泛化体验转变。
Details
Motivation: 探究心理学概念在科学与大众媒体话语中的语义变迁,理解不同传播语境下概念意义的转化机制。 Method: 基于俄语语料库(科学语料76.7万词符,大众科普语料119.9万词符),采用OCR识别、词形还原、停用词过滤等预处理,并结合词频分析、聚类及语义关联识别进行分布语义分析。 Result: 科学文本聚焦方法论与临床术语,大众文本突出日常经验与治疗实践;关键概念如‘倦怠’‘抑郁’在科学语境中关联心理资源、症状学与诊断构念,而在大众语境中则关联个人叙事、情绪与日常情境。 Conclusion: 心理学概念在大众媒体中呈现从专业精确术语向泛化、体验性意义的显著语义偏移,分布语义学方法可有效揭示跨语境的概念语义演变。 Abstract: This article examines semantic shifts in psychological concepts across scientific and popular media discourse using methods of distributional semantics applied to Russian-language corpora. Two corpora were compiled: a scientific corpus of approximately 300 research articles from the journals Psychology. Journal of the Higher School of Economics and Vestnik of Saint Petersburg University. Psychology (767,543 tokens) and a popular science corpus consisting of texts from the online psychology platforms Yasno and Chistye kogntsii (1,199,150 tokens). After preprocessing (OCR recognition, lemmatization, removal of stop words and non-informative characters), the corpora were analyzed through frequency analysis, clustering, and the identification of semantic associations. The results reveal significant differences in vocabulary and conceptual framing between the two discourse types: scientific texts emphasize methodological and clinical terminology, while popular science materials foreground everyday experience and therapeutic practice. A comparison of semantic associations for key concepts such as burnout and depression shows that scientific discourse links these terms to psychological resources, symptomatology, and diagnostic constructs, whereas popular science discourse frames them through personal narratives, emotions, and everyday situations. These findings demonstrate a clear shift from precise professional terminology toward more generalized and experiential meanings in popular media discourse and confirm the effectiveness of distributional semantics methods for identifying semantic transformations of psychological concepts across different communicative contexts.[13] Think Twice Before You Write -- an Entropy-based Decoding Strategy to Enhance LLM Reasoning
Jiashu He,Meizhu Liu,Olaitan P Olaleye,Amit Agarwal,M. Avendi,Yassi Abbasi,Matthew Rowe,Hitesh Laxmichand Patel,Paul Li,Tao Sheng,Sujith Ravi,Dan Roth
Main category: cs.CL
TL;DR: 本文提出了一种基于熵引导的自适应解码框架,通过在高不确定性token位置选择性分支、动态维护部分生成路径,并引入EAT(Entropy After )停止准则,显著提升了LLM推理准确性与计算效率,尤其在小模型上媲美GPT-5性能。
Details
Motivation: 传统解码策略(如贪心、束搜索、采样)存在错误传播或鲁棒性不足问题;自一致性虽提升可靠性但计算开销大,亟需兼顾准确率与效率的新型解码机制。 Method: 提出熵引导解码框架:每步计算token分布熵,识别高不确定性位置并选择性分支;动态维护和扩展部分rollout池;引入EAT停止准则——仅在完整推理链生成后评估熵以决定终止。 Result: 在GSM8K、AMC2023及其扰动变体上取得稳定高准确率;在较小LLM上性能媲美GPT-5,但计算成本大幅降低。 Conclusion: 熵引导的自适应解码能有效聚焦计算资源于不确定性区域,是提升LLM推理效率与鲁棒性的可行新范式。 Abstract: Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches introduce randomness without adequate robustness. Self-consistency improves reliability by aggregating multiple rollouts, but incurs significant computational overhead. We propose an entropy-guided decoding framework that introduces token-level adaptivity into generation. At each step, the model computes the entropy of the token distribution, identifies high-uncertainty positions, and selectively branches on these vulnerable points. A dynamic pool of partial rollouts is maintained and expanded until solutions are completed, concentrating computation where uncertainty is greatest and avoiding unnecessary exploration in confident regions. To enable efficient termination, we apply a rollout-level Entropy After (EAT) stopping criterion by performing entropy evaluation after the full reasoning trace, rather than incrementally at every step. Experiments on GSM8K, AMC2023, and their perturbed variants demonstrate that our method achieves consistently strong accuracy. Notably, on smaller LLMs, performance is comparable to GPT-5 while operating at a fraction of the cost.[14] The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation
Pavel Braslavski,Dmitrii Iarosh,Nikita Sushko,Andrey Sakhovskiy,Vasily Konovalov,Elena Tutubalina,Alexander Panchenko
Main category: cs.CL
TL;DR: 本文提出了一种基于维基百科和Wikidata的可配置多语言实体生成流水线,用于构建如RiDiC这样的评测数据集,以评估大语言模型在长文本生成中的事实性,尤其关注多语言(英/中)场景,并开源了相关代码与数据。
Details
Motivation: 现有评测多依赖短问答(QA)数据集,难以充分评估大语言模型在长文本生成中的事实性;需构建具有可控属性(领域、地理位置、流行度等)的多语言实体集合,支撑更全面的事实性评测。 Method: 设计并实现一个可配置流水线,从Wikipedia和Wikidata中抽取具备指定属性(如领域、地理位置、流行度)的多语言实体;以RiDiC为例,构建含3000个实体(河流、自然灾害、汽车型号)、覆盖英/中双语名称与内容的数据集;收集多个LLM在英/中语境下的长文本生成结果,并用第三方事实性检查器进行评测。 Result: RiDiC数据集成功揭示前沿LLM在长文本生成中仍存在显著幻觉现象;相关代码、数据及评测脚本已全部开源,支持多语言长文本事实性评估。 Conclusion: 该流水线和RiDiC数据集为多语言、长文本事实性评测提供了新范式和实用工具,证实当前LLM在复杂实体知识生成上仍有明显不足。 Abstract: We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs' long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.[15] Detecting Abnormal User Feedback Patterns through Temporal Sentiment Aggregation
Yalun Qi,Sichen Zhao,Zhiming Xue,Xianling Zeng,Zihan Yu
Main category: cs.CL
TL;DR: 本文提出了一种基于时间窗口的用户情感聚合框架,利用RoBERTa等预训练语言模型提取单条评论情感,并聚合为时序情感得分,以检测用户反馈中的异常情感骤降事件。
Details
Motivation: 传统情感分析方法仅关注单条文本分类,难以捕捉用户短评论中因噪声和类别不平衡导致的集体行为时序变化,而实际应用(如品牌声誉管理)需要及时发现情感异常事件。 Method: 提出时间情感聚合框架:使用RoBERTa提取每条评论的情感信号,并在时间窗口内聚合为时序情感得分;通过检测显著下降趋势识别反馈异常。 Result: 在真实社交媒体数据上的实验表明,该方法能有效识别出具有统计显著性的情感骤降,并与一致的投诉模式相吻合。 Conclusion: 该框架为用户反馈异常监测提供了一种有效且可解释的解决方案。 Abstract: In many real-world applications, such as customer feedback monitoring, brand reputation management, and product health tracking, understanding the temporal dynamics of user sentiment is crucial for early detection of anomalous events such as malicious review campaigns or sudden declines in user satisfaction. Traditional sentiment analysis methods focus on individual text classification, which is insufficient to capture collective behavioral shifts over time due to inherent noise and class imbalance in short user comments. In this work, we propose a temporal sentiment aggregation framework that leverages pretrained transformer-based language models to extract per-comment sentiment signals and aggregates them into time-window-level scores. Significant downward shifts in these aggregated scores are interpreted as potential anomalies in user feedback patterns. We adopt RoBERTa as our core semantic feature extractor and demonstrate, through empirical evaluation on real social media data, that the aggregated sentiment scores reveal meaningful trends and support effective anomaly detection. Experiments on real-world social media data demonstrate that our method successfully identifies statistically significant sentiment drops that correspond to coherent complaint patterns, providing an effective and interpretable solution for feedback anomaly monitoring.[16] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
Hiroki Fukui
Main category: cs.CL
TL;DR: 本研究通过多智能体模拟,探究不同大语言模型如何处理伦理指令,发现模型内部伦理处理方式存在显著差异,并提出了四种伦理处理类型,揭示了处理能力与指令格式之间的交互效应。
Details
Motivation: 现有对齐安全研究假设伦理指令能改善模型行为,但模型如何在内部处理这些指令尚不清楚。 Method: 在四个模型(Llama 3.3 70B、GPT-4o mini、Qwen3-Next-80B-A3B、Sonnet 4.5)上,针对四种伦理指令格式(无指令、最小规范、推理型规范、德性框架)和两种语言(日语、英语),开展超600次多智能体模拟;提出三个新指标(DD、VCAD、ORI)刻画伦理处理过程。 Result: 确认Llama日语的‘分离模式’为模型特有;识别出四种伦理处理类型;发现处理能力(DD)与指令格式存在交互效应:低DD模型中指令格式无效,高DD模型中推理型规范与德性框架效果相反;词汇层面的指令遵从性与内部处理指标无显著相关。 Conclusion: 模型的安全输出、表面合规性与真实伦理处理是可分离的;伦理指令效果高度依赖模型内在处理能力;该发现与临床犯罪矫治中的风险识别模式存在结构对应,提示AI安全评估需关注内部处理机制而非仅输出结果。 Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p > .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.[17] Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
Liang Chen,Qi Liu,Wenhuan Lin,Feng Liang
Main category: cs.CL
TL;DR: 本文通过两阶段研究验证了多维对话评估量表在中文婚恋平台上的标准效度,发现不同维度(如需求挖掘、节奏策略)对实际业务转化率的预测能力存在显著差异,等权重综合评分会稀释有效性,而基于转化结果重新加权可提升预测力;研究还揭示了AI对话中‘行为执行但信任缺失’的机制,并提出三层评估架构,倡导将标准效度检验作为对话评估的常规实践。
Details
Motivation: 多维基于量表的对话评估被广泛使用,但其标准效度(即评分是否真正关联下游业务结果)长期缺乏实证检验。 Method: 在真实中文婚恋平台开展两阶段实证研究:Phase 1为混合人/AI对话的小型试点(n=14),Phase 2为60轮经分层抽样与真实转化标签验证的人类对话;采用7维LLM-as-Judge量表,结合Spearman相关、Bonferroni校正、逻辑回归及Trust-Funnel行为分析框架进行归因与机制探索。 Result: Need Elicitation(D1)和Pacing Strategy(D3)与转化显著正相关(rho>0.35, p<0.01),Contextual Memory(D5)无显著关联;等权重综合分(rho=0.272)弱于最优单维,重加权后提升至rho=0.351;D3在控制对话长度后效应增强(OR=3.18);‘评估-结果悖论’被识别为AI/人类代理类型混杂所致;行为分析表明AI缺乏信任构建行为。 Conclusion: 多维对话评估量表存在维度异质性,等权重合成会损害标准效度;应基于下游目标(如转化)进行维度重加权,并将标准效度检验制度化;信任构建是AI对话效能的关键中介机制。 Abstract: Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.[18] Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon
Mukhlis Amien,Go Frendi Gunawan
Main category: cs.CL
TL;DR: 本研究结合规则化同源词剔除与基于26个音系特征的XGBoost机器学习分类器,分析苏拉威西6种南岛语的1357个基础词汇,识别出266个高置信度非主流词汇候选;结果显示这些词汇缺乏跨语言同源家族支持,不支持单一前南岛语底层语言假说,但呈现地理分布模式(苏拉威西语言非主流率显著高于西印尼语言),表明音系机器学习可辅助传统比较法,但音系异常未必意味着共享底层语言。
Details
Motivation: 苏拉威西多种南岛语中存在大量无法按常规音变规律构拟至原始南岛语的词汇,其来源(前南岛语底层 vs 独立创新)尚未经过计算方法检验。 Method: 结合规则驱动的同源词剔除与基于26维音系特征的XGBoost分类器;使用ABVD中6种苏拉威西语言的1357个基础词项;通过原始南岛语交叉验证、Cohen's kappa评估多方法一致性、聚类分析(轮廓系数)及跨语言同源检验进行验证;并扩展至16种其他语言检验地理模式。 Result: 识别出438个候选底层词(26.5%),分类器AUC达0.763,揭示其音系特征为:更长词形、更多辅音簇、更高喉塞音率、更少南岛语前缀;多方法共识确认266个高置信度非主流词;但聚类未形成显著词族(轮廓系数0.114,同源检验p=0.569);扩展分析显示苏拉威西语言非主流率(均值0.606)显著高于西印尼语言(0.393)。 Conclusion: 音系机器学习能有效辅助识别非主流词汇层,但苏拉威西非主流词缺乏系统性同源证据,不支持单一前南岛语底层语言假说;其地理聚集性更可能反映区域创新或多重底层影响,提醒研究者勿将音系异常直接等同于共享底层语言。 Abstract: Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.[19] WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics
Sneha Maurya,Pragya Saboo,Girish Kumar
Main category: cs.CL
TL;DR: 本文提出了女性健康基准(WHBench),用于评估大语言模型在女性健康领域的临床准确性、安全性与公平性,发现现有模型整体表现不佳,无一模型平均得分超过75%,凸显该领域AI应用仍需专家监督和改进。
Details
Motivation: 女性健康在当前医学AI基准中被严重低估,缺乏针对临床关键失败模式(如过时指南、剂量错误、公平性盲区等)的专项评估工具。 Method: 构建包含47个专家设计场景、覆盖10类女性健康主题的WHBench;采用23项标准的评分细则(涵盖临床准确性、安全性、沟通质量、公平性等),引入安全加权惩罚与服务器端重评分机制;对22个模型共3102条响应进行双人标注与分析。 Result: 所有模型平均得分均低于75%,最佳模型仅达72.1%;完全正确率低,危害发生率差异显著;评分者间信度在响应标签层面为中等,在模型排序层面较高。 Conclusion: WHBench是一个公开、聚焦失败模式的基准,可有效支持女性健康AI系统的比较评估,但当前模型尚不满足临床部署的安全与公平要求,亟需专家持续监督与针对性优化。 Abstract: Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in harm rates. Inter-rater reliability is moderate at the response label level but high for model ranking, supporting WHBench utility for comparative system evaluation while highlighting the need for expert oversight in clinical deployment. WHBench provides a public, failure-mode-aware benchmark to track safer and more equitable progress in womens health AI.[20] Brevity Constraints Reverse Performance Hierarchies in Language Models
MD Azizul Hakim
Main category: cs.CL
TL;DR: 本文发现大型语言模型在某些基准测试中表现不如小型模型,原因是规模依赖的冗长倾向导致错误;通过限制响应长度可显著提升大模型性能,证明其潜在能力优于小模型,关键在于尺度感知的提示工程而非通用评估协议。
Details
Motivation: 揭示大型语言模型在标准评估中反直觉地低于小型模型的现象及其根本原因。 Method: 系统评估31个不同参数规模(0.5B–405B)的语言模型在1485个问题上的表现,并开展因果干预实验(如施加简洁性约束)和多项污染检验。 Result: 施加简洁性约束使大模型准确率提升26个百分点,缩小性能差距达三分之二;在数学与科学推理任务中甚至实现性能反转(大模型反超小模型7.7–15.9个百分点);逆向缩放效应在全参数谱连续存在,且各数据集有其最优模型规模。 Conclusion: 大模型具备更强潜在能力,但被通用提示方式掩盖;提升其性能的关键是尺度感知的提示工程,这不仅能提高准确率,还能降低计算开销。 Abstract: Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.[21] "Who Am I, and Who Else Is Here?" Behavioral Differentiation Without Role Assignment in Multi-Agent LLM Systems
Houssam EL Kandoussi
Main category: cs.CL
TL;DR: 本文研究多个大语言模型(LLM)在共享对话中是否会发展出差异化社会角色,还是趋于行为一致。通过控制实验平台,让7个异构LLM在统一后端协同讨论,系统调节组构成、命名方式和提示结构;采用双LLM法官独立编码+人工验证,发现异构组行为分化更显著,且存在自发补偿、命名影响收敛、提示 scaffolding 关键等现象,证明行为多样性源于交互、架构异质性与提示设计的共同作用。
Details
Motivation: 探究多LLM协同对话中是否自发形成社会角色分化,而非简单趋同,以理解多智能体交互中的涌现行为机制。 Method: 构建可控多智能体实验平台,调度7个异构LLM进行同步讨论;设计12组实验(208次运行,13786条消息);对每条消息由Gemini 3.1 Pro和Claude Sonnet 4.6双模型独立标注6类行为标签,并经保守交集裁决;辅以人类专家对609条样本验证;使用余弦相似度与统计检验(p值、效应量r)量化行为分化程度。 Result: (1)异构组行为分化显著强于同构组(余弦相似度0.56 vs. 0.85,p<10^-5,r=0.70);(2)单代理崩溃时群体出现自发补偿响应;(3)显示真实模型名称显著提升行为收敛(0.56→0.77,p=0.001);(4)移除所有提示 scaffolding 后行为相似度升至同构组水平(p<0.001);且这些现象在单代理隔离运行时不存在。 Conclusion: LLM在多智能体交互中会结构化地涌现出行为多样性,该现象依赖于模型架构异质性、群体上下文及提示层级的 scaffolding,而非个体固有属性;揭示了提示工程与群体配置对AI社会行为的关键调控作用。 Abstract: When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi-agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen's kappa = 0.78 with conservative intersection-based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 vs. 0.85; p < 10^-5, r = 0.70); (2) groups spontaneously exhibit compensatory response patterns when an agent crashes; (3) revealing real model names significantly increases behavioral convergence (cosine 0.56 to 0.77, p = 0.001); and (4) removing all prompt scaffolding converges profiles to homogeneous-level similarity (p < 0.001). Critically, these behaviors are absent when agents operate in isolation, confirming that behavioral diversity is a structured, reproducible phenomenon driven by the interaction of architectural heterogeneity, group context, and prompt-level scaffolding.[22] Multi-lingual Multi-institutional Electronic Health Record based Predictive Model
Kyunghoon Hur,Heeyoung Kwak,Jinsu Jang,Nakhwan Kim,Edward Choi
Main category: cs.CL
TL;DR: 本文提出了一种面向多语言、多机构ICU电子健康记录(EHR)预测的文本化统一框架,通过LLM驱动的词级翻译实现语言对齐,显著提升跨数据集泛化性能,优于多语言编码器和需人工标准化的传统方法。
Details
Motivation: 大规模跨机构EHR预测受限于数据模式与编码系统的异质性;而跨国数据还引入语言异质性,亟需无需人工标准化的可扩展解决方案。 Method: 比较两种处理语言障碍策略:(i) 使用多语言编码器直接建模多语言EHR文本;(ii) 利用LLM进行词级翻译将非英语记录统一为英文;在7个公开ICU数据集、10项临床任务上评估性能,并验证少样本微调下的迁移能力。 Result: 翻译对齐策略在跨数据集性能上更稳定可靠,模型优于需人工特征选择/映射的强基线及单数据集训练;文本框架支持有效少样本迁移学习。 Conclusion: 首次实现多语言、多国ICU EHR数据聚合建模,为语言无关临床预测与全球EHR研究提供了可扩展新路径。 Abstract: Large-scale EHR prediction across institutions is hindered by substantial heterogeneity in schemas and code systems. Although Common Data Models (CDMs) can standardize records for multi-institutional learning, the manual harmonization and vocabulary mapping are costly and difficult to scale. Text-based harmonization provides an alternative by converting raw EHR into a unified textual form, enabling pooled learning without explicit standardization. However, applying this paradigm to multi-national datasets introduces an additional layer of heterogeneity, which is "language" that must be addressed for truly scalable EHRs learning. In this work, we investigate multilingual multi-institutional learning for EHR prediction, aiming to enable pooled training across multinational ICU datasets without manual standardization. We compare two practical strategies for handling language barriers: (i) directly modeling multilingual records with multilingual encoders, and (ii) translating non-English records into English via LLM-based word-level translation. Across seven public ICU datasets, ten clinical tasks with multiple prediction windows, translation-based lingual alignment yields more reliable cross-dataset performance than multilingual encoders. The multi-institutional learning model consistently outperforms strong baselines that require manual feature selection and harmonization, and also surpasses single-dataset training. We further demonstrate that text-based framework with lingual alignment effectively performs transfer learning via few-shot fine-tuning, with additional gains. To our knowledge, this is the first study to aggregate multilingual multinational ICU EHR datasets into one predictive model, providing a scalable path toward language-agnostic clinical prediction and future global multi-institutional EHR research.[23] Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang,Derek Li,Bahareh Nikpour,Parsa Omidi
Main category: cs.CL
TL;DR: 本文提出了一种分层链式思维(Hi-CoT)提示方法,通过将推理过程分解为指令性规划与逐步执行交替的层次化子步骤,提升大语言模型在复杂多步推理任务中的准确性和效率。
Details
Motivation: 传统链式思维(CoT)提示存在冗余、逻辑连贯性差及长程推理管理困难等问题,难以应对复杂多步推理任务。 Method: 提出分层链式思维(Hi-CoT)提示范式,将推理过程结构化为交替进行的指令规划与执行子步骤,实现层次化分解。 Result: 在多个大语言模型和数学推理基准上,Hi-CoT平均准确率提升6.2%(最高达61.4%),推理路径长度减少13.9%,且严格遵循层次结构时效果最优。 Conclusion: Hi-CoT是一种更高效、更鲁棒的结构化推理提示方法,显著提升了LLMs在复杂推理任务中的性能与可解释性。 Abstract: Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi-CoT.[24] Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Ashish Rana,Chia-Chien Hung,Qumeng Sun,Julian Martin Kunkel,Carolin Lawrence
Main category: cs.CL
TL;DR: 本文提出Oblivion框架,将遗忘建模为可访问性衰减而非显式删除,通过解耦读写路径实现类人记忆控制,提升LLM智能体在长程交互中的推理效率与适应性。
Details
Motivation: 现有记忆增强型大语言模型智能体采用‘始终开启’的检索和‘扁平化’记忆存储,导致记忆干扰高、延迟大;而人类记忆具有选择性遗忘机制,能根据不确定性与上下文动态调节可访问性。 Method: Oblivion框架将遗忘定义为基于衰减的可访问性降低;解耦记忆读取(依据智能体不确定性与缓存充足性决定是否查询)与写入路径(强化对响应生成有贡献的记忆),支持分层记忆组织。 Result: 在静态与动态长时序交互基准上验证有效:Oblivion能动态调节记忆访问与强化,在变化环境中平衡学习与遗忘,显著降低干扰与延迟,提升推理性能。 Conclusion: 记忆控制(尤其是类人遗忘机制)是提升LLM智能体长程推理能力的关键,Oblivion为构建更高效、自适应的记忆增强智能体提供了新范式。 Abstract: Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory-augmented LLM agents rely on "always-on" retrieval and "flat" memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay-driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always-on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high-level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long-horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM-agentic reasoning. The source code is available at https://github.com/nec-research/oblivion.[25] Polish phonology and morphology through the lens of distributional semantics
Paula Orzechowska,R. Harald Baayen
Main category: cs.CL
TL;DR: 本研究利用分布语义学探讨波兰语词形(音系与形态结构)与其意义之间的关系,发现语义向量不仅能编码句法信息,还能反映子词级语音单位(如辅音簇)的结构特征,并证实语义空间与形式空间之间存在显著同构性。
Details
Motivation: 探究波兰语中辅音簇等复杂音系-形态结构是否在语义空间中有所映射,验证形式与意义之间是否存在系统性关联。 Method: 采用t-SNE、线性判别分析(LDA)和线性辨别学习(LDL)等统计与计算方法,基于词嵌入模型分析波兰语复杂词的音系复杂度、形态透明度及多种形态句法范畴的可预测性。 Result: 语义向量无需词形信息即可准确预测音系复杂度、形态透明度及格、性、体、时、数等形态句法范畴;且判别词典模型借助此类嵌入可在理解和产出任务中实现高精度预测。 Conclusion: 波兰语的语义空间中确实编码了丰富的子词级形式信息,语义与形式结构之间存在显著同构性,支持分布式表征中形式与意义紧密耦合的观点。 Abstract: This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that -- apart from encoding rich morphosyntactic information (e.g. tense, number, case) -- semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.[26] Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations
Haoran Wang,Li Xiong,Kai Shu
Main category: cs.CL
TL;DR: 本文首次系统研究了大语言模型(LLMs)中上下文隐私规范的隐式表征,基于情境完整性(CI)理论发现CI三参数在激活空间中线性可分且功能独立;尽管存在该结构化表征,模型仍发生隐私泄露,表明表征与行为之间存在错位;为此提出CI参数化引导方法,通过分别干预各CI维度显著提升隐私控制效果与可预测性。
Details
Motivation: LLMs在高风险场景中部署日益增多,却常因违反上下文隐私(如不当披露私人信息)引发担忧;核心问题在于:LLMs是否内在编码了上下文隐私规范?若已编码,为何仍持续违规? Method: 基于情境完整性(CI)理论,对多个LLM进行探针分析,验证CI三参数(信息类型、接收者、传输原则)是否作为线性可分且功能独立的方向存在于激活空间;进而提出CI-parametric steering方法,实现对各CI维度的独立干预与调控。 Result: 实证发现CI三参数确以线性可分、功能独立方式编码于LLM激活空间;但模型行为仍频繁泄露隐私,揭示‘表征存在’与‘行为合规’之间的显著gap;CI-parametric steering相较整体式引导,能更有效、更可预测地降低隐私违规。 Conclusion: LLMs的上下文隐私失败源于表征与行为的错位,而非缺乏隐私意识;利用CI的组合式结构可实现更可靠、结构化的隐私控制,为提升LLM上下文隐私理解与治理提供新路径。 Abstract: Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.[27] Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
Tanay Gondil
Main category: cs.CL
TL;DR: 本文研究大型语言模型在响应前预测自身拒绝行为的能力,发现所有模型都具有较高的内省敏感性,但在安全边界处敏感性显著下降;不同模型在准确率、行为可变性和校准性方面表现各异;高置信度预测可达到98.3%的准确率,支持安全关键场景下的置信度路由应用。
Details
Motivation: 大型语言模型虽被训练为拒绝有害请求,但尚不清楚它们能否在实际响应前准确预测自身的拒绝行为,这一能力对提升AI系统安全性和可控性至关重要。 Method: 通过系统性实验设计,让模型先预测自身是否会拒绝某请求,再在新上下文中实际响应;共收集3754个数据点,覆盖300条请求,评估四个前沿模型;采用信号检测理论(SDT)量化内省敏感性(d'),并结合准确率、偏差、校准性及主题分析进行多维评估。 Result: 所有模型均表现出高内省敏感性(d' = 2.4–3.5),但在安全边界处敏感性明显下降;Claude Sonnet 4.5(95.7%)优于Sonnet 4(93.0%),GPT-5.2准确率较低(88.9%)且行为更不稳定,Llama 3.1 405B敏感性高但拒绝偏差大、校准差(准确率仅80.0%);武器类问题最难 introspect;高置信度预测可达98.3%准确率。 Conclusion: 模型具备一定内省能力,但该能力受限于安全边界与模型架构/训练差异;置信度是强预测信号,可用于构建更可靠的安全机制,如基于置信度的路由策略。 Abstract: Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.[28] A Taxonomy of Programming Languages for Code Generation
Nishat Raihan,Christian Newman,Marcos Zampieri
Main category: cs.CL
TL;DR: 本文提出了首个可复现的编程语言资源分类体系,将646种编程语言划分为四个资源等级,并揭示了编程语言资源分布存在极端且系统性的不均衡现象。
Details
Motivation: 现有自然语言处理领域已有语言资源丰富度分类,但编程语言领域尚无类似资源分级体系;随着大语言模型在代码生成方面能力增强,建立编程语言资源分类体系变得至关重要。 Method: 基于七个主流语料库,对646种编程语言进行统计分析,依据其在语料中所占token比例,构建四层资源等级分类体系,并通过统计方法(如组内不平等性、离散度和分布偏斜)验证其系统性不平衡。 Result: 仅1.9%的语言(Tier 3, High)贡献了74.6%的token,而71.7%的语言(Tier 0, Scarce)仅贡献1.0%;该分布呈现极端且系统性的不均衡。 Conclusion: 该分类体系为多语言大模型的数据集构建与评估提供了原则性框架,支持‘按等级感知’的模型评测与数据策展。 Abstract: The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.[29] REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context
Pawin Taechoyotin,Daniel E. Acuna
Main category: cs.CL
TL;DR: 本文提出REM-CTX,一种基于强化学习的同行评审生成系统,通过引入对应感知奖励函数,将图表等辅助上下文信息融入评审生成过程;该系统在多个学科领域实验中显著优于现有基线方法,并揭示了多维度奖励协同优化的重要性。
Details
Motivation: 现有自动同行评审系统大多仅依赖文本内容,忽视图表等视觉元素及外部学术信号等辅助上下文信息,导致评审缺乏全面性和上下文对齐能力。 Method: 提出REM-CTX系统,采用8B参数语言模型,结合Group Relative Policy Optimization(GRPO)训练策略,并设计一个多层次质量奖励与两个对应性奖励(分别针对不同辅助上下文)联合优化。 Result: 在计算机、生物和物理科学领域的手稿上实验表明,REM-CTX在整体评审质量上优于六个基线(包括更大规模商用模型),且在质量与上下文对齐指标上均超越次优RL基线;消融实验证明两个对应奖励互补有效;训练动态分析发现批评维度与其他指标负相关。 Conclusion: 引入辅助上下文感知的对应奖励能显著提升自动生成评审的质量与对齐性;多维奖励应分组优化以避免冲突;REM-CTX为上下文增强型评审生成提供了新范式。 Abstract: Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.[30] LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia,Anirban Chakraborty,Anna Wróblewska
Main category: cs.CL
TL;DR: 本文系统评估了指令调优的大型语言模型(LLMs)在开放作文评分数据集上的表现,发现其在整体评分上与人类评分有中等到高度一致性,但在低阶关注点(如语法、规范)上存在稳定且显著的负向偏差;研究建议采用偏差校正优先策略,利用小规模人工标注数据估计并修正系统性偏差,而非依赖零样本原始分数或大规模微调。
Details
Motivation: 尽管大型语言模型(LLMs)在教育评估中日益受关注,但其与人类评分的一致性尚不明确,亟需系统性实证评估。 Method: 在三个开放作文评分数据集(ASAP 2.0、ELLIPSE、DREsS)上,评估指令调优LLMs在整体与分析式评分中的表现;分析人机评分一致性(Quadratic Weighted Kappa)、方向性偏差、偏差稳定性;比较不同提示模板效果;通过Bootstrap法计算检测偏差所需的最小样本量。 Result: 强开源模型在整体评分中与人类达成中高一致性(QWK≈0.6),但在低阶关注点(LOC)上存在大而稳定的负向偏差;关键词式简短提示优于长篇量规式提示;LOC偏差可在极小验证集(如数十篇)中被可靠检出,而高阶关注点(HOC)通常需更大样本。 Conclusion: 应采用‘偏差校正优先’的部署策略:利用小规模人工标注数据估计并校正系统性偏差,无需大规模微调,即可提升LLM评分的公平性与可靠性。 Abstract: Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-style prompts in multi-trait analytic scoring. To quantify the amount of data needed to detect these systematic deviations, we compute the minimum sample size at which a 95% bootstrap confidence interval for the mean bias excludes zero. This analysis shows that LOC bias is often detectable with very small validation sets, whereas Higher-Order Concern (HOC) traits typically require much larger samples. These findings support a bias-correction-first deployment strategy: instead of relying on raw zero-shot scores, systematic score offsets can be estimated and corrected using small human-labeled bias-estimation sets, without requiring large-scale fine-tuning.[31] Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan,Mengyuan Cui,Rui Zhang
Main category: cs.CL
TL;DR: 本文探讨了自反思提示(self-reflective prompting)在医学多选题问答中的有效性,发现其效果因数据集和模型而异,并不总能提升准确性,更适合作为分析模型行为的工具而非可靠性提升方案。
Details
Motivation: 尽管自反思提示被广泛认为可提升大语言模型在安全关键场景(如医疗)中的可靠性,但其在医学问答中的实际有效性尚不明确。 Method: 在MedQA、HeadQA和PubMedQA三个医学QA基准上,使用GPT-4o和GPT-4o-mini,对比标准思维链(CoT)提示与迭代自反思提示,追踪预测随反思步数的变化,并分析错误修正、持续或新增情况。 Result: 自反思提示未一致提升准确率:在MedQA上有小幅提升,在HeadQA和PubMedQA上效果有限甚至负面;增加反思步数并不保证性能提升。 Conclusion: 推理过程的透明性(如自反思)不等同于推理正确性;自反思推理更适合用于理解模型行为,而非直接提升医学QA的可靠性。 Abstract: Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.[32] Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures
Elliot Murphy
Main category: cs.CL
TL;DR: 本文探讨了生物语言学的核心观点,强调语言是一种内在的生物学器官,而非文化工具,并主张采用数学和代数模型(如MERGE操作)来刻画语言的句法结构,从而为生物学、遗传学和神经科学提供可检验的理论指导。
Details
Motivation: 挑战行为主义语言习得观,推动将语言视为具有生物学基础的内在认知系统,并为跨学科研究提供形式化理论框架。 Method: 通过四个逻辑步骤展开论证:明确生物语言学的研究对象;阐明形式化句法观对进化解释的影响;指出代数句法理论对神经机制的约束;评估当前神经计算研究如何将这些约束转化为可实证检验的假说。 Result: 确立了以MERGE为核心的代数句法模型作为连接形式语言学与神经生物学的关键桥梁,并初步勾勒出可实证检验的跨学科研究路径。 Conclusion: 生物语言学需坚持形式化、代数化的理论建构,这种建构不仅能澄清语言的本质,还能为神经科学和遗传学提供清晰、可操作的研究指令,尽管当前理论仍具推测性和可修正性。 Abstract: Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language. It treats language as an innate biological organ or faculty of the mind, rather than a cultural tool, and it challenges a behaviorist conception of human language acquisition as being based on stimulus-response associations. Extracting its most essential component, it takes seriously the idea that mathematical, algebraic models of language capture something natural about the world. The syntactic structure-building operation of MERGE is thought to offer the scientific community a "real joint of nature", "a (new) aspect of nature" (Mukherji 2010), not merely a formal artefact. This mathematical theory of language is then seen as being able to offer biologists, geneticists and neuroscientists clearer instructions for how to explore language. The argument of this chapter proceeds in four steps. First, I clarify the object of inquiry for biolinguistics: not speech, communication, or generic sequence processing, but the internal computational system that generates hierarchically structured expressions. Second, I argue that this formal characterization matters for evolutionary explanation, because different conceptions of syntax imply different standards of what must be explained. Third, I suggest that a sufficiently explicit algebraic account of syntax places non-trivial constraints on candidate neural mechanisms. Finally, I consider how recent neurocomputational work begins to transform these constraints into empirically tractable hypotheses, while also noting the speculative and revisable character of the present program.[33] Asymmetric Actor-Critic for Multi-turn LLM Agents
Shuli Jiang,Zhaoyang Zhang,Yi Zhang,Shuo Yang,Wei Xia,Stefano Soatto
Main category: cs.CL
TL;DR: 本文提出了一种非对称的actor-critic框架,利用强大的闭源大模型作为actor生成响应,轻量开源小模型作为critic实时监督干预,无需重试或修改actor,显著提升多轮对话中的可靠性与任务成功率。
Details
Motivation: 现有方法依赖反思或后验评估(需多次尝试)或假定模型完全可训练,难以适配不可修改的闭源大模型;而实际应用中常要求单次交互即成功。 Method: 构建非对称actor-critic框架:固定闭源LLM为actor,轻量开源模型为critic;critic在运行时监控并干预actor输出;设计无需修改actor的数据生成流程用于critic监督微调。 Result: 在τ-bench和UserBench上显著超越强单智能体基线;轻量开源critic性能媲美甚至超过更大闭源模型;critic微调进一步优于多种SOTA方法。 Conclusion: 生成-验证能力不对称性可被有效利用,固定actor+可微调轻量critic的架构是提升闭源LLM代理可靠性的一条高效可行路径。 Abstract: Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $τ$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.[34] Large Language Models in the Abuse Detection Pipeline
Suraj Kath,Sanket Badhe,Preet Shah,Ashwin Sampathkumar,Shivani Gupta
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLMs)在滥用检测生命周期(ADL)四个阶段(标注与特征生成、检测、审核与申诉、审计与治理)中的应用,分析了其优势、局限及生产部署挑战,并指出未来需解决延迟、成本、确定性、对抗鲁棒性和公平性等问题。
Details
Motivation: 传统机器学习方法难以应对日益复杂的在线滥用行为和动态变化的政策要求,而大语言模型具备上下文推理、策略理解、解释生成和跨模态理解等新能力,亟需系统性梳理其在安全系统各环节的应用。 Method: 采用生命周期导向的分析框架,将滥用检测划分为四个阶段,分别综述学术研究与工业实践,分析架构设计考量,并评估LLM驱动方法的优缺点。 Result: 明确了LLM在ADL各阶段的具体应用方式、实际部署挑战(如延迟、成本、鲁棒性、公平性)及当前技术局限;提出未来研究应聚焦于提升LLM在大规模安全系统中的可靠性与可问责性。 Conclusion: LLM有望成为现代在线安全系统的核心组件,但要实现可靠、可审计、可扩展的落地,仍需在工程优化、评估基准和治理机制等方面取得实质性进展。 Abstract: Online abuse has grown increasingly complex, spanning toxic language, harassment, manipulation, and fraudulent behavior. Traditional machine-learning approaches dependent on static classifiers and labor-intensive labeling struggle to keep pace with evolving threat patterns and nuanced policy requirements. Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems. This survey provides a lifecycle-oriented analysis of how LLMs are being integrated into the Abuse Detection Lifecycle (ADL), which we define across four stages: (I) Label \& Feature Generation, (II) Detection, (III) Review \& Appeals, and (IV) Auditing \& Governance. For each stage, we synthesize emerging research and industry practices, highlight architectural considerations for production deployment, and examine the strengths and limitations of LLM-driven approaches. We conclude by outlining key challenges including latency, cost-efficiency, determinism, adversarial robustness, and fairness and discuss future research directions needed to operationalize LLMs as reliable, accountable components of large-scale abuse-detection and governance systems.[35] Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang,Levina Li,Rui Sun,Xiao Liang,Yubei Li,Yuchen Wu,Haozheng Luo,Hengli Li,Zhi Zhang,Zhaolu Kang,Kai-Wei Chang,Ying Nian Wu
Main category: cs.CL
TL;DR: 本文提出Agent Q-Mix,一种基于多智能体强化学习(MARL)的框架,将多智能体通信拓扑选择建模为协同优化问题,利用QMIX值分解学习去中心化通信决策,在多个推理与编程基准上实现更高准确率与更优token效率。
Details
Motivation: 解决复杂任务需多个智能体协同,但如何有效选择和连接这些智能体仍缺乏系统方法;现有框架多采用固定或启发式拓扑,缺乏自适应、可学习的通信结构优化机制。 Method: 提出Agent Q-Mix框架:以CTDE范式为基础,结合拓扑感知图神经网络(GNN)编码器、GRU记忆模块和每个智能体独立的Q-head;通过QMIX对通信动作进行联合价值分解,动态构建轮次级通信图;奖励函数兼顾任务准确率与token消耗。 Result: 在7个编程、推理与数学基准测试中平均准确率最高;在Humanity's Last Exam(HLE)上达20.8%准确率,优于Microsoft Agent Framework(19.2%)、LangGraph(19.2%)、AutoGen与Lobster;同时展现出更优token效率与抗智能体失效鲁棒性。 Conclusion: 学习型、去中心化的通信拓扑优化能显著提升多智能体系统的推理能力与资源效率,为构建高效、鲁棒的多智能体LLM系统提供了新范式。 Abstract: Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.[36] Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models
Liancheng Fang,Aiwei Liu,Henry Peng Zou,Yankai Chen,Enze Ma,Leyi Pan,Chunyu Miao,Wei-Chieh Huang,Xue Liu,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出了一种新的独立Metropolis-Hastings采样方法,用于在扩散大语言模型(dLLMs)中平衡生成质量与推理路径探索,克服了低置信度重掩码导致的探索-质量困境。
Details
Motivation: 扩散大语言模型理论上支持任意顺序的token解码,但实践中随机解码质量差;低置信度重掩码虽提升单样本质量,却抑制多路径探索,形成质量与探索的固有矛盾。 Method: 从信息熵角度统一解释该困境,推导出兼顾质量与探索的最优序列分布,并设计一种简单、可部署的独立Metropolis-Hastings采样器在解码中近似实现该分布。 Result: 在MATH500、AIME24/25、HumanEval和MBPP等多个推理基准上,所提方法在Pass@$k$等多样本指标上显著优于随机解码和低置信度重掩码,实现了更优的探索-质量权衡。 Conclusion: 质量与探索的权衡本质是序列分布熵的控制问题;显式建模并采样近似最优分布,而非仅依赖置信度启发式,是提升dLLMs推理能力的有效途径。 Abstract: Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality--exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis--Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.[37] TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning
Wenxuan Jiang,Yuxin Zuo,Zijian Zhang,Xuecheng Wu,Zining Fan,Wenxuan Liu,Li Chen,Xiaoyu Li,Xuezhi Cao,Xiaolong Jin,Ninghao Liu
Main category: cs.CL
TL;DR: 本文提出TR-ICRL框架,通过测试时重思考机制,在无标注数据下利用多数投票生成伪标签作为奖励信号,引导大语言模型在上下文中进行强化学习迭代优化,显著提升其在医学与数学等知识密集型任务上的性能。
Details
Motivation: 解决ICRL中因缺乏真实奖励信号而导致的奖励估计难题,尤其在推理和知识密集型任务中缺乏标注数据的情况下。 Method: TR-ICRL首先从无标签评测集中检索相关样本;LLM为每个样本生成多个候选答案;通过多数投票生成伪标签作为代理奖励;据此提供反馈并迭代优化;最终融合上下文信息,再经多数投票输出最终答案。 Result: 在MedQA上平均提升Qwen2.5-7B达21.23%,在AIME2024上高达137.59%;消融实验验证了方法的有效性与鲁棒性。 Conclusion: TR-ICRL是一种无需真实标签、仅依赖测试时自监督重思考的高效ICRL新范式,适用于多种复杂推理与知识密集型任务。 Abstract: In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.[38] Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
Iyad Ait Hou,Rebecca Hwa
Main category: cs.CL
TL;DR: 本文揭示了神经元激活重叠现象中存在显著的词汇混淆(lexical confound):同一神经元对不同语义但相同词形(如'bank')的响应,常被误判为概念压缩(superposition),实则主要源于词形共享而非语义压缩。
Details
Motivation: 标准指标将同一神经元对不同语义词(如'lender'和'riverside')的共同激活归因为超位置编码(superposition),但作者质疑这可能由词汇形式重叠(如都含'bank')导致的混淆所致。 Method: 采用2×2因子分解实验设计(词形相同/不同 × 语义相同/不同),在110M–70B参数规模的多个语言模型上检验激活重叠来源;进一步分析其在稀疏自编码器中的表现、维度分布及对下游任务(词义消歧、知识编辑)的影响。 Result: 词汇仅相同(不同义)条件下的激活重叠始终显著高于语义仅相同(不同词)条件;该混淆存在于≤1%的激活维度中,影响18–36%的稀疏自编码特征,并损害下游任务性能;过滤该混淆可显著提升词义消歧效果与知识编辑选择性(p=0.002)。 Conclusion: 神经元激活重叠中大量归因于超位置的现象实为词汇形式混淆,需在解释神经表征时谨慎区分词形与语义贡献;修正该混淆可提升模型可解释性与可控性。 Abstract: If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).[39] Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling
Kazuki Yano,Jun Suzuki,Shinji Watanabe
Main category: cs.CL
TL;DR: 本文提出多模态深度扩展方法,通过在冻结的文本大模型中插入新Transformer层并仅训练这些层来适配语音数据,从而在保持文本能力的同时提升语音识别性能。
Details
Motivation: 现有方法在将预训练文本大模型持续预训练为语音语言模型时,往往会损害原有的文本能力。 Method: 提出多模态深度扩展方法,即在冻结的文本大模型中插入新的Transformer层,并仅训练这些新增层;进一步采用专为语音识别设计的E-Branchformer作为插入层。 Result: 在SmolLM2-360M和SmolLM2-1.7B模型及48k小时ASR数据上的实验表明,该方法在ASR性能上媲美全量微调,且文本能力退化显著减少;使用E-Branchformer时,更大模型的ASR性能达到或超过全量微调,文本退化降低超75%,可训练参数减少60%。 Conclusion: 多模态深度扩展是一种高效平衡语音识别性能与文本能力保留的新范式,尤其结合E-Branchformer时效果更优。 Abstract: Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.[40] Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Zhiting Fan,Ruizhe Chen,Tianxiang Hu,Ru Peng,Zenan Huang,Haokai Xu,Yixin Chen,Jian Wu,Junbo Zhao,Zuozhu Liu
Main category: cs.CL
TL;DR: 本文提出了一种基于目标模型反馈优化合成数据生成规则(rubric)的新框架,利用影响估计(influence estimation)量化合成样本对下游任务的训练效用,并以影响分数为奖励通过强化学习优化规则生成器,在多领域、多模型上实现了无需任务微调的稳定提升。
Details
Motivation: 知识密集型领域高质量监督微调(SFT)数据稀缺,现有基于人工设计rubric的合成数据方法依赖专家经验、泛化差、缺乏可量化的性能反馈机制。 Method: 提出基于目标模型梯度的影响估计器来评估合成样本的训练效用;构建rubric-specialized模型生成任务条件化rubric;以影响分数为奖励,用强化学习优化rubric生成过程;引入轻量引导文本提升可控性。 Result: 在人文、社科、医学、法律、金融等多个领域,以及不同目标模型和数据生成器上均取得一致性能提升,且无需任务特定调优,泛化性强。 Conclusion: 合成数据的质量应由其对目标模型的实际训练效用衡量,而非仅靠人工规则或嵌入相似性;基于模型反馈的自动化rubric优化是提升知识密集型领域SFT数据质量的有效范式。 Abstract: Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.[41] A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory
Taihei Shiotani,Masahiro Kaneko,Naoaki Okazaki
Main category: cs.CL
TL;DR: 本文提出了一种新的日语社会偏见评估数据集JUBAKU-v2,基于归因理论,聚焦于推理过程中的群体内外归因偏差,而非仅结论偏差,并验证其比现有基准更敏感地检测模型间性能差异。
Details
Motivation: 现有日语偏见评估基准多依赖英文数据翻译,难以反映日本文化特有偏见,且仅评估结论层面的偏差,忽略推理过程中的偏见。 Method: 基于社会心理学中的归因理论,构建新数据集JUBAKU-v2,固定结论、专门评估推理中对内群体与外群体行为归因的偏差;共含216个体现日本文化特有偏见的样本。 Result: 实验表明,JUBAKU-v2能比现有基准更敏感地检测不同大语言模型在社会偏见上的性能差异。 Conclusion: JUBAKU-v2为评估日语大模型的文化适配性与推理阶段社会偏见提供了更有效、更具文化针对性的新基准。 Abstract: In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,'' which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.[42] More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang,Junyoung Lee
Main category: cs.CL
TL;DR: 本文提出了一种在有限人工标注数据上微调1.7B参数量化小语言模型(SLM)的方法,用于替代有偏、不可复现且存在隐私问题的专有大语言模型(LLM)进行自动评估与标注;该方法通过定制多维评分框架及简单增强/正则化技术,在标注一致性(Krippendorff's α)上超越最优专有LLM,并在情绪分类任务中验证了泛化性,提供了高效、可复现、开源的替代方案。
Details
Motivation: 专有大语言模型在自动评估和标注中存在系统性偏差、不可复现性及数据隐私问题,而人工标注难以跟上文本数据爆炸式增长的需求,亟需高质量、可信赖、开源可控的小模型替代方案。 Method: 在少量人工标注数据上对1.7B参数的4比特量化小语言模型进行微调,引入定制的多维评分框架,并结合简单的数据增强与正则化技术,提升模型与人类专家的一致性。 Result: 所提方法在Krippendorff's α指标上比最优专有LLM高出0.23;在独立的情绪分类任务中也展现出良好泛化能力;训练流程完全开源。 Conclusion: 任务特定对齐与高效量化微调可使小语言模型成为优于专有大模型的开源评估与标注工具,兼顾性能、可控性与实用性。 Abstract: As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human-annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi-dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter-annotator agreement (0.23 points increase in Krippendorff's $α$) than the best performing state-of-the-art proprietary LLM. We also demonstrate the generalizability of the proposed training pipeline on a separate emotion classification task. The results show that task-specific alignment and efficient 4-bit quantized fine-tuning provide superior open-source alternative to using proprietary models for evaluation and annotation. Our finetuning approach is publicly available at https://github.com/jylee-k/slm-judge.[43] Speech LLMs are Contextual Reasoning Transcribers
Keqi Deng,Ruchao Fan,Bo Ren,Yiming Wang,Jinyu Li
Main category: cs.CL
TL;DR: 本文提出链式思维ASR(CoT-ASR),通过构建推理链,使大语言模型(LLM)先对语音输入进行上下文分析,再执行更精准的语音识别,并支持用户引导式转录;同时引入CTC引导的模态适配器缩小语音与文本模态差距;实验显示其在词错误率(WER)和实体错误率(EER)上分别相对降低8.7%和16.9%。
Details
Motivation: 现有ASR主要依赖直接语音到文本映射,难以有效利用大语言模型(LLM)的丰富知识和上下文理解能力。 Method: 提出链式思维ASR(CoT-ASR),构建两阶段单通推理链:先由LLM生成语音的 contextual analysis,再基于该分析完成识别;支持用户提供的上下文引导;并设计CTC引导的模态适配器,利用CTC非空白token概率加权LLM嵌入,对齐语音编码器输出与LLM文本隐空间。 Result: 相比标准LLM-based ASR,CoT-ASR在词错误率(WER)上相对降低8.7%,实体错误率(EER)上相对降低16.9%。 Conclusion: CoT-ASR通过引入链式推理与模态适配机制,显著提升了LLM在ASR任务中的有效性与可控性,拓展了ASR的功能边界。 Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).[44] English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization
Mohammad Mohammadamini,Daban Q. Jaff,Josep Crego,Marie Tahon,Antoine Laurent
Main category: cs.CL
TL;DR: 本文介绍了KUTED数据集,一个用于中库尔德语语音到文本翻译(S2TT)的新型数据集,并提出了一种系统性的文本标准化方法以应对正字法变异问题,显著提升了翻译性能。
Details
Motivation: 由于缺乏针对中库尔德语的高质量语音到文本翻译(S2TT)数据集及正字法变异导致翻译质量下降的问题,作者旨在构建专用数据集并提出有效解决方案。 Method: 构建KUTED数据集(源自TED/TEDx演讲,含91,000句对、170小时英语音频等),设计系统性文本标准化方法,并在Seamless和Transformer模型及级联系统(Seamless ASR + NLLB MT)上进行评估与改进。 Result: 在独立TED测试集上,微调Seamless模型达15.18 BLEU;在FLEURS基准上较Seamless基线提升3.0 BLEU;验证了标准化方法对提升一致性和性能的有效性。 Conclusion: KUTED填补了中库尔德语S2TT资源空白,文本标准化是提升低资源语言S2TT性能的关键策略,所提方法具有实用价值和推广潜力。 Abstract: We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).[45] TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
Lingjie Chen,Ruizhong Qiu,Yuyu Fan,Yanjun Zhao,Hanghang Tong
Main category: cs.CL
TL;DR: 本文提出TRIMS框架,通过轻量级自回归教师模型指导的轨迹感知掩码策略,在扩散语言模型(DLMs)训练中引入解码轨迹监督,显著提升并行解码下的准确率-并行度权衡,且训练成本远低于蒸馏方法。
Details
Motivation: 标准扩散语言模型训练缺乏对token揭示顺序的显式监督,导致训练与推理不匹配,难以发挥并行解码的低延迟优势。 Method: 提出Trajectory-Ranked Instruction Masked Supervision(TRIMS),利用轻量级自回归教师模型生成的轨迹信号,设计轨迹感知的掩码策略,进行监督微调,无需昂贵的DLM蒸馏。 Result: 在LLaDA和Dream模型上,TRIMS在数学与编程基准中显著优于标准MDLM训练和无训练加速基线,性能媲美蒸馏方法但训练成本大幅降低;分析证实其学习到更优解码轨迹。 Conclusion: 轨迹引导的监督对提升扩散语言模型的并行解码效率至关重要,TRIMS以极小开销实现了高效、实用的改进。 Abstract: Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.[46] Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness
Zeyad Ahmed,Paul Sheridan,Michael McIsaac,Aitazaz A. Farooque
Main category: cs.CL
TL;DR: 本文从统计学角度重新解释了TF-IDF,提出其可视为一种带惩罚项的似然比检验的检验统计量,该检验用于捕捉词汇的突发性(burstiness);新框架下导出的词权重方案在文档分类任务中表现与TF-IDF相当。
Details
Motivation: 为理解TF-IDF为何有效,并从统计原理上为其提供理论基础,同时探索基于假设检验的新词权重方法。 Method: 构建一个带伽马惩罚项的beta-binomial分布族作为备择假设以建模词汇突发性,零假设则采用简单二项分布;推导该惩罚似然比检验的检验统计量,并将其解释为类TF-IDF的词权重公式。 Result: 所得检验统计量导出的词权重方案在文档分类任务中性能与TF-IDF相当。 Conclusion: TF-IDF可被自然地理解为一种针对词汇突发性的统计检验结果;假设检验框架为设计更优的词权重方法提供了新思路。 Abstract: TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.[47] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Han Zhu,Lingxuan Ye,Wei Kang,Zengwei Yao,Liyong Guo,Fangjun Kuang,Zhifeng Han,Weiji Zhuang,Long Lin,Daniel Povey
Main category: cs.CL
TL;DR: OmniVoice is a massive multilingual zero-shot TTS model supporting over 600 languages, using a novel diffusion language model-style discrete non-autoregressive architecture that directly maps text to acoustic tokens, enabled by full-codebook random masking and initialization from a pre-trained LLM.
Details
Motivation: To overcome performance bottlenecks of conventional two-stage discrete NAR TTS models and achieve broader multilingual coverage with higher intelligibility. Method: Proposes a diffusion language model-style discrete non-autoregressive architecture that directly maps text to multi-codebook acoustic tokens, using full-codebook random masking for training and initialization from a pre-trained LLM. Result: Achieves state-of-the-art performance on Chinese, English, and diverse multilingual benchmarks, with the broadest language coverage to date (600+ languages), trained on a 581k-hour open-source multilingual dataset. Conclusion: OmniVoice demonstrates that simplifying the TTS pipeline via direct text-to-acoustic mapping—enabled by novel architectural and training innovations—enables scalable, high-quality zero-shot multilingual speech synthesis. Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.[48] AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages
Israel Abebe Azime,Jesujoba Oluwadara Alabi,Crystina Zhang,Iffat Maab,Atnafu Lambebo Tonja,Tadesse Destaw Belay,Folasade Peace Alabi,Salomey Osei,Saminu Mohammad Aliyu,Nkechinyere Faith Aguobi,Bontu Fufa Balcha,Blessing Kudzaishe Sibanda,Davis David,Mouhamadane Mboup,Daud Abolade,Neo Putini,Philipp Slusallek,David Ifeoluwa Adelani,Dietrich Klakow
Main category: cs.CL
TL;DR: 本文介绍了AfrIFact数据集,涵盖十种非洲语言及英语的自动事实核查任务(信息检索、证据提取和事实核查),揭示了现有模型在跨语言检索与多语言事实验证方面的不足,并提出少样本提示和任务微调可显著提升性能。
Details
Motivation: 在线声明的真实性评估至关重要,尤其当涉及医疗、文化等议题且面向信息获取受限的社区时,低资源语言中的事实核查尤为迫切。 Method: 构建AfrIFact多语言事实核查数据集(含10种非洲语言+英语),系统评估嵌入模型跨语言检索能力及大语言模型(LLM)在多语言事实验证上的表现,并采用少样本提示和任务特定微调进行性能提升实验。 Result: 发现当前最佳嵌入模型跨语言检索能力薄弱;文化与新闻类文档比医疗类更易检索;LLM在非洲语言中事实验证能力差,但少样本提示最多提升43%,任务微调再提升26%。 Conclusion: AfrIFact数据集填补了低资源语言事实核查研究空白,推动低资源信息检索、证据检索与事实核查方向的发展。 Abstract: Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.[49] To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
Karan Singh,Michael Yu,Varun Gangal,Zhuofu Tao,Sachin Kumar,Emmy Liu,Steven Y. Feng
Main category: cs.CL
TL;DR: This paper investigates the trade-off between pretraining data and retrieval store size in retrieval-augmented generation (RAG), proposing a three-dimensional scaling framework to guide optimal data allocation under fixed budgets.
Details
Motivation: The relationship between parametric knowledge (from pretraining) and non-parametric knowledge (from retrieval) is poorly understood, especially under fixed data budgets. Method: The authors train OLMo-2-based LMs (30M–3B parameters) on up to 100B tokens of DCLM data, varying both pretraining scale (1–150× parameters) and retrieval store size (1–20×), and evaluate across reasoning, scientific QA, and open-domain QA benchmarks. Result: Retrieval consistently improves performance over parametric-only baselines; they introduce a three-dimensional scaling manifold modeling performance as a function of model size, pretraining tokens, and retrieval corpus size, enabling estimation of optimal data allocation. Conclusion: Marginal utility of retrieval depends strongly on model scale, task type, and pretraining saturation—providing quantitative guidance for when and how retrieval should complement pretraining in scalable LM design. Abstract: Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.[50] LangMARL: Natural Language Multi-Agent Reinforcement Learning
Huaiyuan Yao,Longchao Da,Xiaoou Liu,Charles Fleming,Tianlong Chen,Hua Wei
Main category: cs.CL
TL;DR: LangMARL is a novel framework that adapts multi-agent reinforcement learning (MARL) credit assignment and policy gradient methods to large language model (LLM) agents, enabling more efficient, interpretable, and generalizable coordination in dynamic environments.
Details
Motivation: LLM agents lack fine-grained causal feedback for local policy refinement due to coarse global outcomes, leading to poor autonomous strategy evolution—this is identified as a multi-agent credit assignment problem. Method: LangMARL introduces agent-level language credit assignment, gradient evolution in language space, and causal relation summarization from replayed trajectories to provide dense feedback under sparse rewards. Result: Extensive experiments show LangMARL improves sample efficiency, interpretability, and generalization across diverse cooperative multi-agent tasks. Conclusion: Integrating MARL-inspired credit assignment and gradient-based policy evolution into the language space effectively addresses the credit assignment bottleneck in LLM-based multi-agent systems. Abstract: Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.[51] Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
Zehao Jin,Yanan Sui
Main category: cs.CL
TL;DR: 本文受果蝇全脑连接组启发,提出随机注意力(SA)机制,通过在滑动窗口注意力前对token序列进行随机置换,将固定局部窗口转化为随机全局窗口,在相同计算预算下提升模型感受野和表达能力,并在语言模型预训练与推理中验证了其有效性。
Details
Motivation: 受果蝇大脑连接组中长程连接作为随机捷径实现高效全局通信的启发,旨在解决滑动窗口注意力(SWA)感受野受限、难以建模长程依赖的问题。 Method: 提出随机注意力(SA):在滑动窗口注意力前对输入token序列施加随机置换,执行窗口注意力后再逆置换恢复顺序;多层堆叠下,独立采样的置换使感受野随深度指数增长。 Result: 在语言模型从头预训练中,门控SA+SWA组合取得最佳平均零样本准确率;在Qwen3-8B/30B-A3B无训练推理中,SA持续优于SWA,并在同等计算预算下匹敌或超越块注意力混合(MoBA)。 Conclusion: 连接组启发的随机路由是一种实用且有效的注意力增强范式,可提升高效注意力机制的表达能力,与现有线性及稀疏方法互补。 Abstract: The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.[52] From Baselines to Preferences: A Comparative Study of LoRA/QLoRA and Preference Optimization for Mental Health Text Classification
Mihael Arcan
Main category: cs.CL
TL;DR: 本文系统比较了多种优化策略在心理健康文本分类任务中的效果,强调方法选择比单纯添加偏好训练阶段更重要,并提出了一个从透明基线开始、逐步应用受控调优、并选择性使用偏好优化的实用框架。
Details
Motivation: 心理健康文本分类虽已广泛采用现代适配方法,但关于何时、为何以及如何选择优化策略的实践指导仍十分有限。 Method: 本文采用系统性比较研究方法,从强基线模型出发,依次考察经典编码器参考、参数高效微调(LoRA/QLoRA)在多种目标与优化设置下的表现,以及偏好优化方法(DPO、ORPO、KTO)结合类别重平衡训练的效果。 Result: 结果表明优化效果高度依赖于具体方法:部分策略带来稳定且可迁移的提升,而另一些则对配置和数据平衡极为敏感;尤其偏好优化在不同目标下表现差异显著,说明方法选择本身比引入偏好训练阶段更关键。 Conclusion: 本文核心贡献在于为心理健康NLP任务构建了一个清晰的优化叙事框架:从透明基线起步,实施可控微调,并仅在增益明确时选择性应用偏好优化,从而提供了一种可复现、实用性强、超越架构选择的训练策略指南。 Abstract: Mental health text classification has rapidly adopted modern adaptation methods, yet practical guidance on which optimization strategy to use, when, and why remains limited. This paper presents a systematic comparative study of optimization pathways for a joint mental-health classification task, moving from strong vanilla baselines to progressively more specialized techniques. We first establish classical and encoder references, then examine parameter-efficient supervised fine-tuning with LoRA/QLoRA under multiple objective and optimization settings, and finally evaluate preference-based optimization with DPO, ORPO, and KTO, including class-rebalanced training. Rather than emphasizing a single headline score, we focus on methodological insight: how performance changes with objective formulation, adapter choice, optimizer behavior, context windowing, and class-balance intervention. The results show that optimization effects are highly method-dependent: some approaches deliver stable, transferable gains, while others are sensitive to configuration and data balance. Preference optimization, in particular, exhibits large variation across objectives, indicating that method selection is more consequential than simply adding a preference-training stage. The central contribution is a clear optimization narrative for mental health NLP: start from transparent baselines, apply controlled tuning, and use preference optimization selectively where its gains are demonstrable. This provides a reproducible and practically grounded framework for choosing effective training strategies beyond architecture choice alone.[53] From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks
Ayan Datta,Mounika Marreddy,Alexander Mehler,Zhixue Zhao,Radhika Mamidi
Main category: cs.CL
TL;DR: 本文通过字符计数这一简单符号任务,发现大语言模型内部能正确计算但输出层却失败;机制分析揭示后期MLP层存在'负向电路'抑制正确信号,表明符号推理失败源于模型内部结构化干扰而非表征缺失。
Details
Motivation: 探究大语言模型在基础符号任务(如字符计数)中失败的内在原因,因其虽在复杂基准上表现优异却在简单任务上出错,现有解释不足。 Method: 以字符计数为可控探针,结合探测分类器、激活修补、logit lens分析和注意力头追踪等机制分析方法,研究LLaMA、Qwen、Gemma等模型的内部表征与信息流。 Result: 发现字符级信息在早期和中期层被正确编码,但在倒数第二层和最终层MLP中被特定‘负向电路’系统性衰减,导致正确答案无法输出;模型前向传播呈现竞争性解码机制。 Conclusion: LLM的符号推理失败并非因表征缺失或规模不足,而是由计算图中结构化干扰所致;该现象随模型缩放和指令微调可能加剧,提示需改进模型设计以保障信息可靠编码与使用。 Abstract: Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.[54] Valency Classification of Mapudungun Verbal Roots. Established by the language's own morphotactics
Andrés Chandía
Main category: cs.CL
TL;DR: 本文基于前期对马普切语动词性词根的词类确认工作,进一步依据该语言自身的形态句法规则,对已确认为动词的词根进行配价分类,旨在完善形态分析器Dungupeyum,并推动对马普切语动词配价问题的理论认识。
Details
Motivation: 改进马普切语形态分析器Dungupeyum,并深化对马普切语动词配价问题的理论理解。 Method: 基于马普切语自身的形态句法规则,通过考察各类后缀与词根或动词词干在动词形式中的允许及受限组合,进行动词性词根的配价分类。 Result: 完成了对已确认为动词的马普切语词根的配价分类,并将验证结果整合进Dungupeyum形态分析器。 Conclusion: 该研究不仅提升了Dungupeyum的分析能力,也为马普切语动词配价现象提供了更准确的描述与理论阐释。 Abstract: In the previous work, a lexical (re)categorisation -- or confirmation of the given category -- of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language's own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.[55] Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding
Hemanth Kotaprolu,Kishan Maharaj,Raey Zhao,Abhijit Mishra,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: 本文提出EmoScene基准,用于评估语言模型在多维情感理解上的能力,并设计了一种基于贝叶斯推理的后处理框架以提升预测的一致性。
Details
Motivation: 现有情感理解基准多依赖短文本和预定义标签,忽视了情感间的结构依赖关系,难以支持上下文感知的多维情感推理。 Method: 构建了基于Plutchik理论的EmoScene基准(含4731个富上下文场景,每例标注8维情感向量),并在零样本下评测6个指令微调大模型;进一步提出一种融合情感共现统计的纠缠感知贝叶斯推理框架进行联合后验推断。 Result: 最佳模型在EmoScene上Macro F1仅为0.501;所提贝叶斯后处理方法显著提升弱模型性能(如Qwen2.5-7B提升+0.051 Macro F1),并增强预测结构一致性。 Conclusion: EmoScene为多维情感理解提供了更具挑战性的评估基准,揭示了当前大语言模型在此任务上的局限性,并验证了引入情感结构先验可有效提升性能。 Abstract: Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 context-rich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik's basic emotions. We evaluate six instruction-tuned large language models in a zero-shot setting and observe modest performance, with the best model achieving a Macro F1 of 0.501, highlighting the difficulty of context-aware multi-label emotion prediction. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing improves structural consistency of predictions and yields notable gains for weaker models (e.g., +0.051 Macro F1 for Qwen2.5-7B). EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.[56] Agentic Tool Use in Large Language Models
Jinchao Hu,Meizhi Zhong,Kehai Chen,Xuefeng Bai,Min Zhang
Main category: cs.CL
TL;DR: 本文系统梳理了大语言模型作为自主智能体在工具使用方面的研究进展,提出了三种范式:即插即用式提示、监督式工具学习和基于奖励的工具策略学习,并分析了各自的方法、优势、失败模式及评估现状,旨在解决当前研究碎片化的问题,提供更结构化的演进视角。
Details
Motivation: 现有工具使用研究在任务、工具类型和训练设置上过于分散,缺乏对方法差异与演进的统一理解。 Method: 将文献归纳为三种工具使用范式: prompting as plug-and-play、supervised tool learning 和 reward-driven tool policy learning,并对各范式的方法、优劣、评估方式与挑战进行系统分析。 Result: 构建了一个结构化的工具使用研究分类框架,厘清了不同范式的技术路径与适用边界,并指出了当前评估体系与实际部署中的关键挑战。 Conclusion: 工具使用研究正从简单提示走向端到端策略学习,需建立统一评估标准与跨范式融合机制,以支撑大模型真正成为可靠自主智能体。 Abstract: Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, reviews the evaluation landscape and highlights key challenges, aiming to address this fragmentation and provide a more structured evolutionary view of agentic tool use.[57] KUET at StanceNakba Shared Task: StanceMoE: Mixture-of-Experts Architecture for Stance Detection
Abdullah Al Shafi,Md. Milon Islam,Sk. Imran Hossain,K. M. Azharul Hasan
Main category: cs.CL
TL;DR: 本文提出StanceMoE模型,一种基于BERT的上下文增强型混合专家(MoE)架构,用于隐式目标人物的立场检测,在StanceNakba 2026数据集上达到94.26%宏F1。
Details
Motivation: 现有基于Transformer的立场检测模型依赖统一表征,难以充分捕捉对比性话语结构、框架线索和显著词汇指示等异质语言信号,需能显式建模多样化立场表达模式的自适应架构。 Method: 提出StanceMoE模型:以微调BERT为编码器,集成六个专注不同语言信号的专家模块(全局语义倾向、显著词汇线索、从句级焦点、短语级模式、框架指示、对比驱动的话语转换),并引入上下文感知门控机制动态加权各专家贡献。 Result: 在含1401条英文文本、目标人物隐式的StanceNakba 2026 Subtask A数据集上,StanceMoE取得94.26%宏F1,优于传统基线及其它BERT变体。 Conclusion: 上下文增强的MoE架构能更有效地建模立场表达的异质性语言信号,显著提升隐式目标人物的立场检测性能。 Abstract: Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.[58] When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou,Chunyu Miao,Wei-Chieh Huang,Yankai Chen,Yue Zhou,Hanrong Zhang,Yaozu Wu,Liancheng Fang,Zhengyao Gu,Zhen Zhang,Kening Zheng,Fangxin Wang,Yi Nian,Shanghao Li,Wenzhe Fan,Langzhou He,Weizhi Zhang,Xue Liu,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出了InterruptBench,首个面向长周期、环境驱动型网页导航任务的可中断智能体基准,系统研究了LLM智能体在用户中途插入新需求、修改或撤回目标时的适应能力与恢复效率。
Details
Motivation: 现有基准多假设智能体执行过程不受干扰,或仅在短文本任务中研究中断;而实际部署中,LLM智能体需在长周期、状态持续变化的动态环境中应对用户实时干预。 Method: 形式化定义三类现实中断(添加、修订、撤回),基于WebArena-Lite构建带严格语义约束的InterruptBench,并设计统一中断模拟框架,在单轮与多轮中断场景下评估6个主流LLM骨干模型。 Result: 实验表明,即使是最强的大规模LLM,在长周期任务中仍难以高效、有效地响应和恢复用户中断。 Conclusion: 可中断性是当前LLM智能体落地的关键瓶颈,InterruptBench为该方向提供了首个系统性评测基准与分析框架。 Abstract: As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.[59] GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training
Jesse van Oort,Frank Brinkkemper,Erik de Graaf,Bram Vanroy,Saskia Lensink
Main category: cs.CL
TL;DR: 本文介绍了GPT-NL Public Corpus,这是目前最大、可自由许可的荷兰语语料库,包含360亿预处理荷兰语词元,并整合了英语、代码及德语/丹麦语数据,所有数据均经合规性筛选并以CC-BY许可证公开发布。
Details
Motivation: 为支持构建合法、有用且无害的(商用)语言模型,需一个大规模、高质量、许可清晰的荷兰语语料资源。 Method: 整合与新建结合:筛选现有大型语料(如Common Crawl、Common Corpus),并联合机构采集或合成增强荷兰语专用数据;对全部数据进行合规性评估与再加工;统一采用CC-BY许可发布。 Result: 构建完成GPT-NL Public Corpus,含36B荷兰语、207B英语、232B代码、48B德语/丹麦语词元,全部公开于Hugging Face Hub。 Conclusion: 该语料库填补了高质量、可商用荷兰语预训练资源的空白,为多语言及本地化大模型开发提供了坚实基础。 Abstract: We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.[60] Positional Cognitive Specialization: Where Do LLMs Learn To Comprehend and Speak Your Language?
Luis Frentzen Salim,Lun-Wei Ku,Hsing-Kuo Kenneth Pao
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在训练过程中如何习得新语言,提出基于感知(输入理解)与产出(输出生成)功能分工的分析框架,并据此设计轻量级适配方法CogSym,仅微调25%最外层参数即可接近全参数微调效果。
Details
Motivation: 适应新语言成本高且不透明;现有研究多关注已训练模型如何处理多语指令,而忽视其在训练中习得语言的机制。 Method: 通过在decoder-only transformer上对低资源语言进行层消融实验(从前向后和从后向前),分析语言感知与产出能力在不同层的分布规律,并据此提出分层启发式微调策略CogSym。 Result: 仅微调25%最外层参数,下游任务性能仅比全量微调低2-3%;CogSym效果与LoRA等适配器方法相当,具备泛化性。 Conclusion: 语言模型在训练中自然形成感知与产出的功能分层;利用该规律可实现高效、可解释、低成本的多语言适配,推动包容性语言建模。 Abstract: Adapting large language models (LLMs) to new languages is an expensive and opaque process. Understanding how language models acquire new languages and multilingual abilities is key to achieve efficient adaptation. Prior work on multilingual interpretability research focuses primarily on how trained models process multilingual instructions, leaving unexplored the mechanisms through which they acquire new languages during training. We investigate these training dynamics on decoder-only transformers through the lens of two functional cognitive specializations: language perception (input comprehension) and production (output generation). Through experiments on low-resource languages, we demonstrate how perceptual and productive specialization emerges in different regions of a language model by running layer ablation sweeps from the model's input and output directions. Based on the observed specialization patterns, we propose CogSym, a layer-wise heuristic that enables effective adaptation by exclusively fine-tuning a few early and late layers. We show that tuning only the 25% outermost layers achieves downstream task performance within 2-3% deviation from the full fine-tuning baseline. CogSym yields consistent performance with adapter methods such as LoRA, showcasing generalization beyond full fine-tuning. These findings provide insights to better understand how LLMs learn new languages and push toward accessible and inclusive language modeling.[61] Phase transition on a context-sensitive random language model with short range interactions
Yuma Toji,Jun Takahashi,Vwani Roychowdhury,Hideyuki Miyahara
Main category: cs.CL
TL;DR: 本文构建了一个具有短程相互作用的随机语言模型,并通过数值模拟发现即使在上下文长度固定的情况下,该模型仍能发生相变,表明语言模型中的有限温度相变源于语言本身的内在特性,而非长程相互作用。
Details
Motivation: 澄清语言模型中观察到的相变是否源于真正的语言特性,而非仅由长程相互作用引起。 Method: 构建一类属于Chomsky层次中上下文有关文法的、具有显式上下文引用能力的短程随机语言模型,并进行数值模拟研究其统计性质。 Result: 发现即使上下文长度不随句子长度增长而增长,模型仍发生相变。 Conclusion: 语言模型中的有限温度相变是语言固有特性的体现,而非长程相互作用所致。 Abstract: Since the random language model was proposed by E. DeGiuli [Phys. Rev. Lett. 122, 128301], language models have been investigated intensively from the viewpoint of statistical mechanics. Recently, the existence of a Berezinskii--Kosterlitz--Thouless transition was numerically demonstrated in models with long-range interactions between symbols. In statistical mechanics, it has long been known that long-range interactions can induce phase transitions. Therefore, it has remained unclear whether phase transitions observed in language models originate from genuinely linguistic properties that are absent in conventional spin models. In this study, we construct a random language model with short-range interactions and numerically investigate its statistical properties. Our model belongs to the class of context-sensitive grammars in the Chomsky hierarchy and allows explicit reference to contexts. We find that a phase transition occurs even when the model refers only to contexts whose length remains constant with respect to the sentence length. This result indicates that finite-temperature phase transitions in language models are genuinely induced by the intrinsic nature of language, rather than by long-range interactions.[62] Dual Optimal: Make Your LLM Peer-like with Dignity
Xiangqi Wang,Yue Huang,Haomin Zhuang,Kehan Guo,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出Dignified Peer框架,旨在解决当前对齐语言模型存在的‘逃避型仆人’问题(即盲目迎合用户错误信念并推卸责任),通过反谄媚、可信度、共情与创造力来构建有尊严且平等的AI代理。
Details
Motivation: 当前对齐语言模型存在‘Evasive Servant’双重失败模式:一方面谄媚式认可用户错误信念,另一方面用模板化免责声明逃避责任,亟需更尊重、平等的交互范式。 Method: 提出Dignified Peer框架;构建具有组合性偏序结构的PersonaKnob数据集;设计容忍式约束拉格朗日DPO算法以动态平衡多维人格偏好;采用项目反应理论(IRT)进行心理测量校准的评估协议。 Result: 实证研究表明,该方法成功构建出兼具尊严(dignity)与同伴感(peer)的LLM代理,在反谄媚、可信、共情与创造力等维度显著提升,且评估结果能有效剥离裁判偏差等混杂因素。 Conclusion: Dignified Peer框架为语言模型对齐提供了新范式,强调模型应作为有原则、可信赖、富有共情的平等协作者,而非无条件服从或回避责任的仆人。 Abstract: Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.[63] Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts
Daniel Miehling,Sandra Kuebler
Main category: cs.CL
TL;DR: 本文提出了一种多模态分析管道,用于研究YouTube Shorts中地缘政治事件(如以色列-哈马斯战争)的报道方式,结合自动转录、基于方面的情感分析(ABSA)和语义场景分类,发现不同国家资助媒体在情感表达上存在差异,而视觉场景分类与现实事件一致;同时表明小型领域适配模型在情感分析中优于大型Transformer和大语言模型。
Details
Motivation: YouTube Shorts已成为新闻消费的重要渠道,但关于地缘政治事件在此类短视频中如何被表征的研究仍十分有限。 Method: 构建了一个融合自动语音转录、基于方面的情感分析(ABSA)和语义场景分类的多模态分析流程,并在2300多个与冲突相关的YouTube Shorts及9.4万多个视频帧上进行实证应用。 Result: 不同国家资助媒体在特定方面的情感表达存在显著差异且随时间变化;视觉场景分类结果与真实事件高度一致;小型领域适配模型在情感分析任务中性能优于大型Transformer模型和大语言模型(LLMs)。 Conclusion: 该多模态分析流程可推广至TikTok、Instagram等其他短视频平台,为人文社科研究提供了资源高效、可解释性强的计算方法范式。 Abstract: YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.[64] Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization
Gyuseok Lee,Wonbin Kweon,Zhenrui Yue,SeongKu Kang,Jiawei Han,Dong Wang
Main category: cs.CL
TL;DR: 本文提出了一种不确定性感知的奖励分解框架VRF,通过变分分布建模用户偏好,在共享偏好空间中进行推断与匹配,显著提升个性化LLM在多场景下的鲁棒性与泛化能力。
Details
Motivation: 现有奖励分解方法在稀疏数据下孤立估计确定性用户权重,导致推理不准确、不可靠。 Method: 提出变分奖励分解(VRF):用变分编码器推断用户偏好分布;通过Wasserstein距离匹配共享概率基函数获取权重;引入方差衰减损失降低不确定估计的影响。 Result: 在三个基准上,VRF在已见/未见用户、少样本场景及不同不确定性水平下均超越所有基线,并提升下游对齐效果。 Conclusion: VRF通过建模用户偏好的不确定性,提升了奖励分解的可靠性与泛化性,为个性化LLM提供了更鲁棒的框架。 Abstract: Reward factorization personalizes large language models (LLMs) by decomposing rewards into shared basis functions and user-specific weights. Yet, existing methods estimate user weights from scarce data in isolation and as deterministic points, leading to inaccurate and unreliable inference. We introduce Variational Reward Factorization (VRF), an uncertainty-aware framework that represents each user's preferences as a variational distribution in a shared preference space. VRF infers user distributions via a variational encoder, derives weights through Wasserstein distance matching with shared probabilistic bases, and downweights uncertain estimates through a variance-attenuated loss. On three benchmarks, VRF outperforms all baselines across seen and unseen users, few-shot scenarios, and varying uncertainty levels, with gains extending to downstream alignment.[65] Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics
Fred Zimmerman,Hilmar AI
Main category: cs.CL
TL;DR: 本文研究作者在文本信息论新颖性曲线中是否具有独特的‘指纹’特征,发现作者风格在书籍和章节层面均能被量化识别,且该特征跨时代稳定存在。
Details
Motivation: 探究作者是否在文本的信息论新颖性曲线上具有可识别的个体特征(即‘指纹’),以验证作者风格是否具有可量化的统计痕迹。 Method: 基于Books3和PG-19两个大规模图书语料库,分别在书籍层面(使用均值、速度、体积、迂回度等标量动力学指标)和章节层面(使用SAX符号化 motif 模式分析滑动窗口)建模 novelty 曲线,并评估作者识别准确率;同时控制体裁混淆变量并比较古今作者表现。 Result: 书籍层面标量特征可显著高于随机水平识别43%的作者;章节层面SAX motif实现30倍于随机水平的归因准确率,远超标量特征;该指纹信号部分受体裁影响,但在约1/4作者中仍具组内稳定性;经典作家(如吐温、奥斯汀、吉卜林)指纹强度与现代作者相当。 Conclusion: 作者在novelty动态曲线上具有真实、多尺度、部分独立于体裁的统计指纹,且该现象跨越历史时期,表明作者风格具有深层、可量化的信息结构基础。 Abstract: We test whether authors have characteristic "fingerprints" in the information-theoretic novelty curves of their published works. Working with two corpora -- Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) -- we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.[66] Temporal Dependencies in In-Context Learning: The Role of Induction Heads
Anooshka Bajaj,Deven Mahesh Mistry,Sahaj Singh Maini,Yash Aggarwal,Billy Dickson,Zoran Tiganj
Main category: cs.CL
TL;DR: 本文发现开源大语言模型在上下文学习中表现出类似序列回忆的+1滞后偏差,这种偏差与专门处理重复token后token的‘归纳头’(induction heads)密切相关;通过消融实验验证了这些注意力头对时序上下文处理和序列回忆行为的关键作用。
Details
Motivation: 探索大语言模型如何在上下文中追踪和检索信息,尤其是其在类自由回忆任务中表现出的序列性行为机制。 Method: 基于认知科学中的自由回忆范式,分析多个开源LLM在输入中重复token后的概率峰值模式;通过系统性消融实验,定量评估‘归纳头’对+1滞后偏差及少样本序列回忆性能的影响。 Result: LLMs普遍存在显著的+1滞后偏差;高归纳得分的注意力头被移除后,该偏差及序列回忆性能明显下降,而随机移除头则无此效应。 Conclusion: 归纳头在Transformer中承担着关键的时序上下文处理功能,是实现有序信息检索和类序列回忆行为的核心机制。 Abstract: Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.[67] CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance
Haochen Liu,Weien Li,Rui Song,Zeyu Li,Chun Jason Xue,Xiao-Yang Liu,Sam Nallaperuma,Xue Liu,Ye Yuan
Main category: cs.CL
TL;DR: 本文提出CARE框架,通过远程和本地大语言模型协同处理ICU中症状与体征矛盾的临床数据,在保护隐私前提下提升器官功能恶化预测性能。
Details
Motivation: 大型语言模型在医疗等高风险决策场景中表现不佳,尤其当患者自述症状与客观医学体征存在矛盾时;真实ICU环境中此类证据冲突普遍存在,亟需鲁棒且隐私合规的解决方案。 Method: 构建MIMIC-DOS数据集(源自MIMIC-IV,仅含症状-体征不一致病例),并提出CARE:一种多阶段、隐私合规的智能体推理框架——远程LLM生成结构化分类与状态转移规则(不接触敏感数据),本地LLM基于其指导完成证据整合与决策。 Result: CARE在所有关键指标上均显著优于单次调用LLM及各类代理流水线基线,展现出更强的冲突证据处理能力与隐私保护能力。 Conclusion: 将推理过程解耦为远程指导与本地执行的双层架构,可有效应对临床证据矛盾问题,同时满足医疗数据隐私要求,为高风险领域LLM部署提供新范式。 Abstract: Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.[68] Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai,Mashiro Toyooka,Zaiying Zhao,Kenta Watanabe,Toshihiko Yamasaki,Kiyoharu Aizawa
Main category: cs.CL
TL;DR: 本文提出了首个系统性评估框架PaperRecon,用于量化AI编码代理撰写论文的质量与风险,并构建了包含51篇顶会论文的基准测试集PaperWrite-Bench;实验发现模型能力提升带来呈现质量与幻觉之间的权衡。
Details
Motivation: AI驱动的论文写作日益普遍,但缺乏对其质量与潜在风险的严谨、统一评估,亟需建立可靠评估框架。 Method: 提出PaperRecon评估框架:基于原文生成overview.md,由AI代理据此及少量资源重写全文,再从Presentation(评分量表)和Hallucination(基于原文的智能体评估)两个正交维度进行评估;构建PaperWrite-Bench基准(51篇2025年后顶会论文)。 Result: 实验表明ClaudeCode呈现质量更高但平均每篇幻觉超10处,Codex幻觉更少但呈现质量较低,揭示了二者在模型演进中的质量-幻觉权衡。 Conclusion: 本工作首次建立了AI论文写作的系统评估框架,为研究社区理解其风险与可靠性提供了基础工具与实证依据。 Abstract: This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.[69] Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning
Mohammad R. Abu Ayyash
Main category: cs.CL
TL;DR: 本文提出Brainstacks,一种用于大语言模型持续多领域微调的模块化架构,通过冻结适配器堆栈在共享冻结基座上加性组合来封装领域专长,并引入MoE-LoRA、残差增强、课程依赖训练、零空间投影和基于结果的元路由器等五项关键技术,在避免遗忘的同时实现跨领域知识组合与迁移。
Details
Motivation: 解决大语言模型在持续多领域微调中面临的灾难性遗忘、领域耦合困难以及适配器堆叠导致性能下降等问题,提升模型对不同领域任务的泛化与组合能力。 Method: 提出Brainstacks架构,包含五个核心组件:(1) 基于Shazeer噪声top-2路由的MoE-LoRA,支持QLoRA 4-bit量化与rsLoRA缩放;(2) 内循环残差增强机制,冻结已训堆栈并新增堆栈;(3) 外循环按课程顺序训练领域专用堆栈;(4) 利用随机SVD进行零空间投影,确保新堆栈方向正交于旧方向以实现零遗忘;(5) 基于实证域组合目标训练的sigmoid元路由器,实现堆栈选择性加权与跨域组合。另含PSN预训练与DPO/GRPO对齐验证两个边界实验。 Result: 在TinyLlama-1.1B(4领域9堆栈)和Gemma 3 12B IT(5领域10堆栈)上验证:MoE-LoRA收敛速度达单LoRA的2.5倍;残差增强突破单堆栈性能瓶颈;路由系统恢复未门控堆叠导致的生成质量退化;元路由器发现领域堆栈实际编码的是可迁移的认知原语(如指令遵循清晰度、数值推理等),而非领域知识——医学提示97%路由至chat+math堆栈,而这些堆栈从未接触过医学数据。 Conclusion: Brainstacks证明了将领域知识解耦为可组合、可迁移的认知原语是可行的;其模块化、正交化与路由驱动的设计为持续学习与多任务泛化提供了新范式,显著缓解遗忘并支持零样本跨域迁移。 Abstract: We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.[70] S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young
Main category: cs.CL
TL;DR: 本文提出S0调优方法,通过仅优化每个循环层的单个初始状态矩阵(冻结所有模型权重),在HumanEval等基准上显著超越LoRA,且零推理开销;该方法在混合架构语言模型中表现出强跨领域迁移能力,但对纯Transformer无效,验证了其基于轨迹引导机制的有效性。
Details
Motivation: 在监督数据稀缺的情况下,寻找一种零推理开销、高效且适用于混合架构(如Mamba/Transformer)的参数高效微调(PEFT)方法。 Method: S0调优:每层循环网络仅学习一个初始状态矩阵,其余模型权重完全冻结;不引入额外推理延迟,支持快速任务切换与状态复用。 Result: 在HumanEval上比LoRA高+10.8 pp;Qwen3.5-4B上+23.6 pp;FalconH1-7B上与LoRA性能统计无异但无需权重合并;MATH-500和GSM8K上显著跨域迁移;纯Transformer(Qwen2.5-3B)上prefix-tuning全面劣化;per-step变体达+27.1 pp但有推理成本。 Conclusion: 循环神经状态初始化是一种适用于混合语言模型、低资源场景下的高效PEFT新范式,兼具高性能、零开销与部署便捷性。 Abstract: Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.[71] Embarrassingly Simple Self-Distillation Improves Code Generation
Ruixiang Zhang,Richard He Bai,Huangjie Zheng,Navdeep Jaitly,Ronan Collobert,Yizhe Zhang
Main category: cs.CL
TL;DR: 本文提出了一种仅利用大语言模型自身原始输出进行自我蒸馏(SSD)的方法,无需验证器、教师模型或强化学习,即可显著提升代码生成能力。
Details
Motivation: 探索在不依赖外部监督信号(如验证器、教师模型或强化学习)的情况下,仅通过模型自身输出能否提升其代码生成能力。 Method: 提出简单自我蒸馏(SSD)方法:以特定温度和截断配置对模型采样生成解,再用标准监督微调方式在这些样本上微调模型。 Result: SSD将Qwen3-30B-Instruct在LiveCodeBench v6上的pass@1从42.4%提升至55.3%,增益集中在更难的问题上,并在Qwen和Llama系列多个规模(4B/8B/30B)及指令/思维变体上泛化良好。 Conclusion: SSD揭示了LLM解码中精度-探索的权衡冲突,能上下文感知地重塑token分布,在需精度处抑制干扰尾部,在需探索处保留多样性,为LLM代码生成提供了一种有效的后训练优化路径。 Abstract: Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.[72] ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
Nandan Thakur,Zijian Chen,Xueguang Ma,Jimmy Lin
Main category: cs.CL
TL;DR: 本文提出ORBIT,一个包含20K推理密集型查询的合成训练数据集,采用无需付费API的节俭框架生成,涵盖15个领域、每题需4-5步推理,并经自我与外部网络验证;基于该数据集微调Qwen3-4B模型,在维基问答任务中展现出优于同类小模型的搜索代理性能。
Details
Motivation: 构建面向深度研究任务(多步检索与推理)的高质量训练数据集成本高昂(依赖人工标注或复杂前置条件),亟需低成本、可扩展的合成数据生成方法。 Method: 提出四阶段节俭合成框架:种子生成 → 问答对构造 → 自我验证 → 外部全网搜索验证;生成ORBIT数据集(20K样本,15领域,4–5步推理,短可验证答案);在ORBIT上使用GRPO算法微调Qwen3-4B模型。 Result: ORBIT-4B在维基百科问答任务中显著优于其他<4B参数规模的语言模型作为搜索代理,验证了合成数据的有效性与实用性。 Conclusion: 无需付费API的模块化合成框架能高效生成高质量推理型搜索训练数据,ORBIT数据集及其训练方法为小模型搜索代理提供了实用且开源的新范式。 Abstract: Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question--answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4--5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.[73] LLM REgression with a Latent Iterative State Head
Yiheng Su,Matthew Lease
Main category: cs.CL
TL;DR: RELISH是一种轻量级的文本回归架构,通过在冻结的大语言模型(LLM)表示上迭代优化潜在状态,并用线性回归器输出标量预测值,在多个数据集和模型设置下显著优于现有方法,同时参数量极低(仅3.4–3.7M)。
Details
Motivation: 现有LLM文本回归方法存在效率低、精度不足或参数开销大等问题,如文本解码易出错、多输出聚合不稳定、预测头设计受限等,亟需一种高效、准确且参数经济的新范式。 Method: RELISH采用冻结LLM主干,引入一个可学习的迭代潜在状态,通过跨注意力机制持续融合token级表征;最终将收敛的潜在状态经线性层映射为标量预测值,不依赖解码或集成。 Result: 在5个数据集、4种LLM主干和2种训练范式下,RELISH全面超越三大类基线方法(自回归解码、回归感知推理、现有预测头),且仅引入3.4–3.7M可训练参数(占LLM总参0.01–0.04%),远低于LoRA方案(0.26–0.42%)。 Conclusion: RELISH验证了轻量迭代潜在状态建模在LLM文本回归中的有效性,为兼顾高精度与极低参数开销提供了新思路,适用于资源受限场景下的回归任务部署。 Abstract: We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).[74] $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He,Adit Jain,Anand Kumar,Vincent Tu,Soumyadeep Bakshi,Sachin Patro,Nazneen Rajani
Main category: cs.CL
TL;DR: 本文提出YC-Bench基准,用于评估大语言模型代理在长周期、不确定性环境下的战略一致性能力(如规划、延迟反馈学习与错误适应),实验表明当前前沿模型仍存在显著能力缺口,如过度并行化和对抗性客户识别失败。
Details
Motivation: 随着LLM代理处理任务复杂度提升,其在长周期内保持战略一致性的能力(如规划、从延迟反馈中学习、适应早期错误的累积影响)成为关键问题,但缺乏有效评估基准。 Method: 构建YC-Bench——一个模拟初创公司一年运营(数百步)的基准,要求代理在部分可观测、含对抗性客户和增长薪资压力的环境中管理雇员、选择任务合同并维持盈利;评估12个开源/闭源模型(各3次随机种子),分析成功/失败模式及影响因素(如scratchpad使用、对抗客户检测)。 Result: 仅3个模型稳定超越初始资金20万美元;Claude Opus 4.6平均终值最高(127万美元),GLM-5以11倍更低推理成本达121万美元;scratchpad使用是成功最强预测因子;47%破产源于对抗客户识别失败;前沿模型仍存在过并行化等特有失败模式。 Conclusion: YC-Bench揭示了当前LLM代理在长周期战略决策中的关键能力缺口,强调需改进状态持久化机制与对抗环境感知能力;该基准开源、可复现、可配置,为后续研究提供标准化评估工具。 Abstract: As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.[75] Universal YOCO for Efficient Depth Scaling
Yutao Sun,Li Dong,Tianzhu Ye,Shaohan Huang,Jianyong Wang,Furu Wei
Main category: cs.CL
TL;DR: 本文提出YOCO-U,结合YOCO解码器架构与递归计算,在保持高效推理的同时提升LLM的推理与智能体能力。
Details
Motivation: 标准Transformer在测试时扩展计算效率低,存在高计算开销和随深度增长的KV缓存问题。 Method: 提出Universal YOCO(YOCO-U),基于YOCO框架构建Universal Self-Decoder,通过参数共享实现多轮迭代,并将迭代限制在浅层高效注意力层;融合常量全局KV缓存、线性预填充与部分递归以增强表征深度。 Result: YOCO-U在通用与长上下文基准测试中保持高度竞争力,提升了token利用率与缩放行为,验证了高效注意力与递归计算结合的有效性。 Conclusion: YOCO-U通过协同YOCO架构与递归计算,实现了能力与效率的更优权衡,为可扩展大语言模型提供了新方向。 Abstract: The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.cs.CV [Back]
[76] Hierarchical Pre-Training of Vision Encoders with Large Language Models
Eugene Lee,Ting-Yu Chang,Jui-Huang Tsai,Jiajie Diao,Chen-Yi Lee
Main category: cs.CV
TL;DR: HIVE是一种新型的视觉编码器分层预训练框架,通过在视觉编码器与大语言模型间引入分层交叉注意力机制,实现多层级结构化特征融合,提升视觉-语言对齐效果,并采用三阶段训练策略优化对齐过程,在多项视觉-语言任务上取得领先性能。
Details
Motivation: 现有方法将视觉编码器和大语言模型视为独立模块,难以有效整合层次化视觉特征,限制了视觉-语言对齐能力。 Method: 提出HIVE框架,引入分层交叉注意力机制实现视觉编码器与LLM间的多层特征融合,并设计三阶段渐进式训练策略以稳定优化对齐过程。 Result: 在MME、GQA、OK-VQA、ScienceQA等视觉-语言基准及图像分类任务上均优于基于自注意力的方法。 Conclusion: 分层特征融合能显著提升视觉-语言模型的表达能力和效率,为构建更强大的多模态模型提供新路径。 Abstract: The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.[77] RawGen: Learning Camera Raw Image Generation
Dongyoung Kim,Junyong Lee,Abhijith Punnappurath,Mahmoud Afifi,Sangmin Han,Alex Levinshtein,Michael S. Brown
Main category: cs.CV
TL;DR: 本文提出RawGen,首个支持文本到原始图像生成及sRGB到raw逆转换的扩散模型框架,通过利用sRGB扩散先验与构建多对一逆ISP数据集,实现面向任意目标相机的物理一致线性图像生成,并验证其在低层视觉任务数据增强中的有效性。
Details
Motivation: 现有原始图像数据集规模小、硬件依赖性强,而传统扩散模型仅生成sRGB图像,缺乏对物理一致线性表示(如raw或XYZ)的建模能力,制约低层视觉任务发展。 Method: 提出RawGen框架:1)构建多对一逆ISP数据集,将不同ISP参数生成的多个sRGB图像映射至同一场景参考目标;2)在该数据集上微调条件去噪器与专用解码器,实现sRGB到相机特定raw/XYZ的逆渲染;3)支持文本驱动的raw图像生成。 Result: RawGen在sRGB-to-raw逆转换任务上显著优于假设固定ISP的传统方法;生成的合成raw数据用于下游低层视觉任务训练时,可带来性能提升。 Conclusion: RawGen首次实现了基于扩散模型的、面向任意相机的文本到raw生成与sRGB-to-raw逆转换,为低层视觉提供可扩展、物理一致的合成数据源。 Abstract: Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity -- however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen's superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen's scalable, text-driven synthetic data can benefit downstream low-level vision tasks.[78] Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Longwei Xu,Feng Feng,Shaojie Zhang,Xin Chen,Hang Li,Anan Du,Hailong Yu,Pei Fu,Zhenbo Luo,Jian Luan
Main category: cs.CV
TL;DR: 本文提出Q-Mask框架,通过因果查询驱动的掩码解码器(CQMD)实现细粒度文本-区域定位,结合TextAnchor-26M数据集训练,在TextAnchor-Bench(TABench)基准上显著提升OCR模型的文本锚定能力。
Details
Motivation: 现有视觉语言模型(VLMs)虽具备OCR能力,但在实际应用中缺乏可靠的文本空间锚定(即精准定位文本所在图像区域)能力,亟需系统评估与改进。 Method: 提出Q-Mask框架,基于因果查询驱动掩码解码器(CQMD),采用类思维链(visual CoT)机制:先顺序生成查询条件下的视觉掩码以定位文本位置,再识别文本内容;并构建大规模带细粒度文本掩码标注的数据集TextAnchor-26M用于训练。 Result: 在新提出的TextAnchor-Bench(TABench)基准上验证,Q-Mask显著提升了文本锚定精度与稳定性,优于通用及OCR专用VLMs;在多种真实场景下展现出更强的文本理解与定位能力。 Conclusion: 将‘定位’(where)与‘识别’(what)解耦,并通过因果视觉解码和强空间先验训练,是提升VLM文本锚定能力的有效范式;Q-Mask为可靠OCR与视觉语言理解提供了新思路与实用工具。 Abstract: Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.[79] Suppressing Non-Semantic Noise in Masked Image Modeling Representations
Martine Hjelkrem-Tan,Marius Aasan,Rwiddhi Chakraborty,Gabriel Y. Arteaga,Changkyu Choi,Adín Ramírez Rivera
Main category: cs.CV
TL;DR: 本文提出SOAP方法,通过主成分分析评估语义不变性,并在掩码图像建模(MIM)中抑制非语义信息,从而提升零样本性能,且无需额外训练。
Details
Motivation: 掩码图像建模(MIM)虽广泛应用,但其目标导致学习到的表征保留了非语义信息,损害推理性能。 Method: 提出基于PCA的语义不变性评分,并设计模型无关的后处理方法SOAP,通过正交投影直接抑制patch表征中的非语义信息。 Result: SOAP在多种MIM模型上显著提升零样本性能,且无需训练、即插即用。 Conclusion: 非语义信息对MIM表征有害,SOAP提供了一种简单有效的通用抑制方案,提升了语义表征质量与下游泛化能力。 Abstract: Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.[80] Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Xinpeng Li,Bolin Lai,Hardy Chen,Shijian Deng,Cihang Xie,Yuyin Zhou,James Matthew Rehg,Yapeng Tian
Main category: cs.CV
TL;DR: 本文提出了Omni-MMSI任务,要求从原始音视频和语音中全面理解社交互动,并设计了Omni-MMSI-R参考引导流水线以提升身份归属与社交推理能力。
Details
Motivation: 现有AI助手难以在真实场景下从原始多模态数据中准确感知和推理社交互动,尤其缺乏可靠的身份归属能力;而以往研究多依赖预处理过的社会线索,脱离实际应用需求。 Method: 提出Omni-MMSI-R参考引导流水线,结合工具生成身份归属的社会线索,并通过链式思维进行社交推理;构建参与者级参考对并标注推理数据。 Result: Omni-MMSI-R在Omni-MMSI任务上显著优于先进多模态大模型及其他基线方法。 Conclusion: Omni-MMSI-R有效提升了从原始多模态数据中进行身份归属与社交交互理解的能力,为开发更智能的AI助手提供了新路径。 Abstract: We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.[81] OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
Taiting Lu,Kaiyuan Lin,Yuxin Tian,Yubo Wang,Muchuan Wang,Sharique Khatri,Akshit Kartik,Yixi Wang,Amey Santosh Rane,Yida Wang,Yifan Yang,Yi-Chao Chen,Yincheng Jin,Mahanth Gowda
Main category: cs.CV
TL;DR: 本文提出了OmniSch基准,用于评估大型多模态模型(LMMs)在印刷电路板(PCB)原理图理解与空间加权网表图构建任务上的能力,揭示了当前LMMs在细粒度定位、布局到图解析、全局连通性推理和视觉探索等方面存在显著不足。
Details
Motivation: 现有LMMs在视觉定位、文档理解和图表推理方面进展迅速,但在将PCB原理图转换为能同时捕捉元件属性、连接关系和几何信息的机器可读空间加权网表图方面仍缺乏研究,而该图是电子设计自动化(EDA)流程的核心。 Method: 提出首个面向原理图理解与空间网表图构建的综合基准OmniSch,包含1854张真实原理图及四项任务:(1) 原理图实体视觉定位;(2) 图表到图的拓扑关系理解;(3) 几何感知的连接权重构建;(4) 工具增强的智能体式视觉搜索。 Result: 实验表明当前LMMs在原理图工程图像理解上存在明显短板:细粒度定位不可靠、布局到图解析鲁棒性差、全局连通性推理不一致、视觉探索效率低。 Conclusion: OmniSch填补了LMMs在硬件设计领域原理图理解评估的空白,为未来提升其在EDA等专业场景中的能力提供了明确方向和评测标准。 Abstract: Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.[82] Excite, Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling
Deepank Singh,Anurag Nihal,Vedhus Hoskere
Main category: cs.CV
TL;DR: 本文提出EASe框架,通过SAUCE和CAFE模块在像素级特征上实现无监督、领域无关的细粒度语义分割,显著提升复杂形态场景下的显著物体发现性能。
Details
Motivation: 现有无监督分割方法依赖粗粒度的patch级表征,在具有复杂多组分形态的场景中难以保留关键的细粒度结构细节。 Method: 提出EASe框架,包含语义感知上采样与通道激励(SAUCE)模块以激发低分辨率基础模型特征通道,并跨空间编码图像与FM特征进行注意力融合;再通过无训练的线索注意力特征聚合器(CAFE)利用SAUCE注意力分数生成多粒度掩码。 Result: EASe在多个标准基准和复杂形态数据集上显著超越现有最优方法(SOTA)。 Conclusion: EASe通过在像素级特征上操作,实现了更准确的细粒度密集语义掩码发现,具备领域无关性和强泛化能力。 Abstract: Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at https://ease-project.github.io[83] The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment
Hongyuan Liu,Qinli Yang,Wen Li,Zhong Zhang,Jiaming Liu,Wei Han,Zhili Qin,Jinxia Guo,Junming Shao
Main category: cs.CV
TL;DR: 本文提出TPC-CMA框架,通过三阶段课程学习显式缩小视觉-语言模型中的模态间隙(包括质心间隙和分布间隙),显著提升跨模态对齐性能,在多种任务上取得明显改进。
Details
Motivation: 现有方法仅缓解模态间隙的全局质心偏移,未能解决根本的分布不匹配问题;作者发现分布间隙才是影响跨模态任务质量的关键因素,因此需针对性建模并优化。 Method: 将模态间隙分解为质心间隙与分布间隙,并提出TPC-CMA细调框架:包含联合校正质心偏移与重塑分布结构的跨模态对齐模块(CMA),以及基于梯度感知调度的三阶段课程学习策略。 Result: 在α_target=0.05时模态间隙降低66.6%,准确率仅降4.84%;在α_target=0.5时间隙降低82.3%,聚类ARI从0.318升至0.516,captioning的CIDEr提升57.1%。 Conclusion: 分布间隙比原始模态间隙更能预测跨模态任务性能;TPC-CMA能有效协同优化质心与分布两方面,实现更鲁棒、更高质量的跨模态对齐。 Abstract: Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ($R^2 = 0.986$), whereas the commonly used Raw Gap is misleading ($R^2 = 0.691$). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With $α_{\text{target}}{=}0.05$, the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment ($α_{\text{target}}{=}0.5$), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.[84] SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction
Italo Felix Santos,Gilson Antonio Giraldi,Heron Werner Junior
Main category: cs.CV
TL;DR: SANA-I2I是一种无需文本提示的高分辨率图像到图像生成框架,通过仅使用配对的源-目标图像,在潜在空间中学习条件流匹配模型,用于胎儿MRI运动伪影抑制。
Details
Motivation: 解决医疗影像中依赖文本提示的图像到图像生成方法不适用的问题,尤其在缺乏高质量配对数据(如胎儿MRI)的场景下,需发展文本无关、监督式且高效的图像转换方法。 Method: 提出SANA-I2I框架,完全去除文本条件,基于配对图像学习潜在空间中的条件速度场(即条件流匹配模型);采用Duffy等人提出的合成策略生成带运动伪影的胎儿MRI配对训练数据。 Result: 在胎儿MRI运动伪影抑制任务上,SANA-I2I能有效抑制伪影并保持解剖结构,且仅需少量推理步数即达竞争性性能。 Conclusion: 基于流匹配、文本无关的生成模型在监督式医学图像转换任务中具有高效性与实用性,为无文本模态的图像生成提供了新范式。 Abstract: We propose SANA-I2I, a text-free high-resolution image-to-image generation framework that extends the SANA family by removing textual conditioning entirely. In contrast to SanaControlNet, which combines text and image-based control, SANA-I2I relies exclusively on paired source-target images to learn a conditional flow-matching model in latent space. The model learns a conditional velocity field that maps a target image distribution to another one, enabling supervised image translation without reliance on language prompts. We evaluate the proposed approach on the challenging task of fetal MRI motion artifact reduction. To enable paired training in this application, where real paired data are difficult to acquire, we adopt a synthetic data generation strategy based on the method proposed by Duffy et al., which simulates realistic motion artifacts in fetal magnetic resonance imaging (MRI). Experimental results demonstrate that SANA-I2I effectively suppresses motion artifacts while preserving anatomical structure, achieving competitive performance few inference steps. These results highlight the efficiency and suitability of our proposed flow-based, text-free generative models for supervised image-to-image tasks in medical imaging.[85] Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings
Thomas Manuel Rost
Main category: cs.CV
TL;DR: 本文提出了一种基于冻结基础模型嵌入(DINOv3 ViT-B)的半监督自训练方法,仅需不到5%的标注数据即可在AQUA20水下物种识别任务上接近全监督性能,无需微调、领域适配或额外工程,是一种即插即用的标签高效方案。
Details
Motivation: 水下图像物种分类受限于专家标注成本高,且监督模型跨场景泛化能力差。 Method: 利用冻结的DINOv3 ViT-B模型提取图像嵌入,通过最近邻自训练方式将少量标注样本的标签传播至大量无标签数据,不进行任何微调或训练。 Result: 在AQUA20基准(20类海洋物种)上,仅用<5%标注数据即显著缩小与全监督ConvNeXt基线的性能差距;全监督时部分物种甚至超越基线;嵌入空间中各类ROC-AUC保持高位,表明冻结表征本身具有强判别性。 Conclusion: 冻结基础模型嵌入+简单自训练构成一种实用、零训练、零适配的标签高效水下物种识别新基线。 Abstract: Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.[86] VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space
Jihao Lyu,Minghua Zhao,Jing Hu,Yifei Chen,Shuangli Du,Cheng Shi
Main category: cs.CV
TL;DR: VADMamba++ 是一种基于 Gray-to-RGB 范式的高效视频异常检测方法,摒弃光学流等辅助输入,仅用单通道灰度帧重建三通道RGB帧,在单一代理任务下通过结构与色彩双重不一致性提升异常敏感性,并融合Mamba、CNN与Transformer建模正常模式,结合显式预测误差与隐式特征量化误差进行评分,显著提升单任务设定下的精度与效率。
Details
Motivation: VADMamba虽性能优异,但严重依赖光学流辅助输入和跨任务融合评分,限制其在单一代理任务下的适用性;亟需一种无需辅助输入、适配单任务的更通用、高效VAD方法。 Method: 提出Gray-to-RGB范式,实现单通道灰度帧到三通道RGB帧的重建映射;设计融合Mamba、CNN和Transformer的混合骨干网络以建模多样正常模式;采用单任务内融合评分策略,联合显式未来帧重建误差与隐式量化特征误差。 Result: 在三个基准数据集上超越现有最优方法,在严格单任务(仅帧级输入)设定下兼顾高精度与高推理效率。 Conclusion: VADMamba++验证了无辅助输入、单任务驱动的重建范式在VAD中的有效性,通过结构-色彩双不一致性增强异常可辨性,为轻量、通用VAD提供了新思路。 Abstract: VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.[87] Neural Reconstruction of LiDAR Point Clouds under Jamming Attacks via Full-Waveform Representation and Simultaneous Laser Sensing
Ryo Yoshida,Takami Sato,Wenlun Zhang,Yuki Hayakawa,Shota Nagai,Takahiro Kado,Taro Beppu,Ibuki Fujioka,Yunshan Zhong,Kentaro Yoshioka
Main category: cs.CV
TL;DR: 本文提出PULSAR-Net,利用LiDAR的全波形数据在干扰攻击下重建真实点云,通过新型U-Net与轴向空间注意力机制识别攻击信号,在纯合成数据训练下实现实测高重建率。
Details
Motivation: LiDAR传感器易受干扰攻击导致完全失明,但其全波形数据仍保留攻击与合法信号的可区分特征,亟需利用该中间表示提升抗干扰能力。 Method: 提出PULSAR-Net:基于U-Net架构,引入轴向空间注意力机制,专用于全波形表示中区分干扰信号与真实目标回波;并构建物理感知的全波形合成数据生成流程以弥补真实标注缺失。 Result: 仅用合成数据训练,PULSAR-Net在真实静态和行驶场景中对受干扰遮挡车辆的点云重建率分别达92%和73%。 Conclusion: 全波形数据是提升LiDAR抗干扰鲁棒性的关键中间表征,PULSAR-Net验证了其在实际场景中的有效性与可行性。 Abstract: LiDAR sensors are critical for autonomous driving perception, yet remain vulnerable to spoofing attacks. Jamming attacks inject high-frequency laser pulses that completely blind LiDAR sensors by overwhelming authentic returns with malicious signals. We discover that while point clouds become randomized, the underlying full-waveform data retains distinguishable signatures between attack and legitimate signals. In this work, we propose PULSAR-Net, capable of reconstructing authentic point clouds under jamming attacks by leveraging previously underutilized intermediate full-waveform representations and simultaneous laser sensing in modern LiDAR systems. PULSAR-Net adopts a novel U-Net architecture with axial spatial attention mechanisms specifically designed to identify attack-induced signals from authentic object returns in the full-waveform representation. To address the lack of full-waveform representations in existing LiDAR datasets under jamming attacks, we introduce a physics-aware dataset generation pipeline that synthesizes realistic full-waveform representations under jamming attacks. Despite being trained exclusively on synthetic data, PULSAR-Net achieves reconstruction rates of 92% and 73% for vehicles obscured by jamming attacks in real-world static and driving scenarios, respectively.[88] Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition
Qiong Liu,Ruofei Xiong,Xingzhen Chen,Muyao Peng,You Yang
Main category: cs.CV
TL;DR: 本文提出一种动态图模型,通过自适应节点选择机制,有效利用RGB和深度模态的关键局部特征进行室内场景识别。
Details
Motivation: 现有方法虽知RGB和深度模态的局部特征对识别至关重要,但缺乏自适应选择与有效利用这些关键特征的方法。 Method: 构建动态图模型,引入自适应节点选择机制提取RGB与深度模态的关键局部特征;按远近关系分三级组织节点;利用注意力权重动态更新图结构;最后融合优化后的双模态特征进行识别。 Result: 在SUN RGB-D和NYU Depth v2数据集上实验表明,该方法性能优于现有最先进方法,并验证了其对RGB与深度关键局部特征的有效挖掘能力。 Conclusion: 所提动态图模型及其自适应节点选择机制能更有效地建模和融合RGB-D多模态局部特征,显著提升室内场景识别精度。 Abstract: Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.[89] UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
Daehyun Kim,Youngmin Kim,Yoon Ju Oh,Tae Hyun Kim
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的不确定性感知上下文记忆网络(UCMNet),用于解决屏下摄像头(UDC)图像恢复中因显示层衍射和散射导致的空间变化复杂退化问题,尤其在恢复高频细节方面优于现有方法。
Details
Motivation: 现有基于PSF物理建模和频域分离的方法虽能较好重建低频结构和保持色彩一致性,但在处理复杂、空间变化的退化时难以恢复精细细节。 Method: 提出不确定性感知上下文记忆网络(UCMNet),通过不确定性驱动损失估计空间不确定性图,指导记忆库从上下文库中检索区域自适应上下文,实现对UDC非均匀退化特性的有效建模。 Result: UCMNet在多个基准上达到SOTA性能,且参数量比先前模型减少30%。 Conclusion: UCMNet通过引入不确定性先验实现自适应图像恢复,在保证轻量化的同时显著提升UDC图像高频细节恢复能力,验证了不确定性建模在空间变化退化任务中的有效性。 Abstract: Under-display cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight \textbf{U}ncertainty-aware \textbf{C}ontext-\textbf{M}emory \textbf{Network} (\textbf{UCMNet}), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30\% fewer parameters than previous models. Project page: \href{https://kdhrick2222.github.io/projects/UCMNet/}{https://kdhrick2222.github.io/projects/UCMNet}.[90] mmAnomaly: Leveraging Visual Context for Robust Anomaly Detection in the Non-Visual World with mmWave Radar
Tarik Reza Toha,Shao-Jung,Lu,Mahathir Monjur,Shahriar Nirjon
Main category: cs.CV
TL;DR: 本文提出mmAnomaly,一种结合毫米波雷达与RGBD视觉输入的多模态异常检测框架,通过视觉语义引导生成预期毫米波频谱,并比对真实与生成频谱实现高精度、可解释的异常定位。
Details
Motivation: 毫米波雷达虽可在遮挡或隐私敏感场景下进行人体感知,但其信号易受材料、杂波和多径干扰影响,导致现有异常检测方法缺乏上下文感知、误报率高。 Method: mmAnomaly融合毫米波雷达与RGBD数据:利用轻量ResNet提取场景几何与材质等语义信息;采用条件潜在扩散模型根据视觉上下文合成预期毫米波频谱;设计双输入对比模块检测真实与生成频谱的空间偏差以定位异常。 Result: 在两个多模态数据集、三种应用(藏匿武器定位、穿墙入侵者定位、穿墙跌倒定位)上验证,F1分数最高达94%,定位误差小于1米,具备跨衣物、遮挡与杂乱环境的强泛化能力。 Conclusion: mmAnomaly是一种准确、可解释的上下文感知毫米波异常检测框架,显著提升了鲁棒性与实用性。 Abstract: mmWave radar enables human sensing in non-visual scenarios-e.g., through clothing or certain types of walls-where traditional cameras fail due to occlusion or privacy limitations. However, robust anomaly detection with mmWave remains challenging, as signal reflections are influenced by material properties, clutter, and multipath interference, producing complex, non-Gaussian distortions. Existing methods lack contextual awareness and misclassify benign signal variations as anomalies. We present mmAnomaly, a multi-modal anomaly detection framework that combines mmWave radar with RGBD input to incorporate visual context. Our system extracts semantic cues-such as scene geometry and material properties-using a fast ResNet-based classifier, and uses a conditional latent diffusion model to synthesize the expected mmWave spectrum for the given visual context. A dual-input comparison module then identifies spatial deviations between real and generated spectra to localize anomalies. We evaluate mmAnomaly on two multi-modal datasets across three applications: concealed weapon localization, through-wall intruder localization, and through-wall fall localization. The system achieves up to 94% F1 score and sub-meter localization error, demonstrating robust generalization across clothing, occlusions, and cluttered environments. These results establish mmAnomaly as an accurate and interpretable framework for context-aware anomaly detection in mmWave sensing.[91] Mine-JEPA: In-Domain Self-Supervised Learning for Mine-Like Object Classification in Side-Scan Sonar
Taeyoun Kwon,Youngwon Choi,Hyeonyu Kim,Myeongkyun Cho,Junhyeok Choi,Moon Hwan Kim
Main category: cs.CV
TL;DR: 本文提出了Mine-JEPA,首个面向侧扫声呐(SSS)水雷分类的领域内自监督学习(SSL)方法,仅用1170张无标签声呐图像预训练,性能超越大规模通用视觉基础模型DINOv3,且参数量更少。
Details
Motivation: 侧扫声呐水雷分类面临极端数据稀缺和与自然图像间巨大域差距的挑战,而现有自监督学习与通用视觉基础模型在此领域尚未被充分探索。 Method: 提出Mine-JEPA,基于正则化自监督损失SIGReg,在仅1170张无标签SSS图像上进行领域内预训练;结合合成数据增强,并在二类(水雷/非水雷)与三类(水雷类目标)任务上评估;对比微调DINOv3等基线。 Result: Mine-JEPA在二分类F1达0.935(优于DINOv3的0.922),三分类达0.820(优于DINOv3的0.810);使用ViT-Tiny时参数仅为DINOv3的1/4;额外对基础模型做领域内SSL反而导致性能下降10–13个百分点。 Conclusion: 在数据稀缺的声呐图像领域,精心设计的领域内自监督学习可替代更大规模的通用视觉基础模型,无需依赖海量跨域预训练数据。 Abstract: Side-scan sonar (SSS) mine classification is a challenging maritime vision problem characterized by extreme data scarcity and a large domain gap from natural images. While self-supervised learning (SSL) and general-purpose vision foundation models have shown strong performance in general vision and several specialized domains, their use in SSS remains largely unexplored. We present Mine-JEPA, the first in-domain SSL pipeline for SSS mine classification, using SIGReg, a regularization-based SSL loss, to pretrain on only 1,170 unlabeled sonar images. In the binary mine vs. non-mine setting, Mine-JEPA achieves an F1 score of 0.935, outperforming fine-tuned DINOv3 (0.922), a foundation model pretrained on 1.7B images. For 3-class mine-like object classification, Mine-JEPA reaches 0.820 with synthetic data augmentation, again outperforming fine-tuned DINOv3 (0.810). We further observe that applying in-domain SSL to foundation models degrades performance by 10--13 percentage points, suggesting that stronger pretrained models do not always benefit from additional domain adaptation. In addition, Mine-JEPA with a compact ViT-Tiny backbone achieves competitive performance while using 4x fewer parameters than DINOv3. These results suggest that carefully designed in-domain self-supervised learning is a viable alternative to much larger foundation models in data-scarce maritime sonar imagery.[92] Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge
Jinrong Zhang,Canyang Wu,Xusheng He,Weili Guan,Jianlong Wu,Liqiang Nie
Main category: cs.CV
TL;DR: 本文提出TEP方法,通过引入跟踪增强提示来提升SAM3在复杂视频对象分割任务中对小目标和语义主导目标的理解能力,无需训练即显著提升性能,在PVUW挑战赛2026中取得第一名。
Details
Motivation: 现有最优方法SAM3在处理微小目标和语义主导目标时表现不佳,根源在于其对这类特定目标理解不足。 Method: 提出TEP(Tracking-Enhanced Prompts)方法,利用外部跟踪模型和多模态大语言模型生成跟踪增强提示,以增强SAM3对难例目标的理解,且无需额外训练。 Result: 在PVUW Challenge 2026复杂视频对象分割赛道测试集上取得第一名,得分为56.91%。 Conclusion: TEP是一种训练无关的轻量级增强方案,有效弥补了SAM3在复杂场景下对细粒度与语义敏感目标的理解缺陷,显著提升了复杂视频对象分割性能。 Abstract: In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.[93] VLM-in-the-Loop: A Plug-In Quality Assurance Module for ECG Digitization Pipelines
Jiachen Li,Shihao Li,Soovadeep Bakshi,Wei Li,Dongmei Chen
Main category: cs.CV
TL;DR: 本文提出VLM-in-the-Loop框架,通过工具锚定(tool grounding)机制,将视觉语言模型(VLM)反馈闭环集成到ECG图像数字化流程中,无需修改原有系统,显著提升真实临床图像的数字化质量与一致性。
Details
Motivation: 现有ECG数字化方法在基准测试中表现良好,但在真实世界图像上性能急剧下降,亟需一种鲁棒、可插拔的质量保障机制。 Method: 提出VLM-in-the-Loop插件式质量保障模块,核心为‘工具锚定’:利用领域专用信号分析工具生成定量证据,引导VLM进行可验证的评估判断;该模块通过标准化接口封装任意后端,不依赖其内部结构。 Result: 在200例配对真值数据上,判决一致性从71%提升至89%,保真度区分度(ΔPCC)翻倍(0.03→0.08);在4个不同后端上均显著提升性能,如Open-ECG-Digitizer有效导联数从2.5升至5.8;在428张真实HCM临床图像上达到98.0%‘优秀’质量。 Conclusion: VLM-in-the-Loop结合工具锚定是一种通用、即插即用的质量增强范式,其设计具备跨领域迁移潜力,适用于任何具有客观质量标准的数字化任务。 Abstract: ECG digitization could unlock billions of archived clinical records, yet existing methods collapse on real-world images despite strong benchmark numbers. We introduce \textbf{VLM-in-the-Loop}, a plug-in quality assurance module that wraps any digitization backend with closed-loop VLM feedback via a standardized interface, requiring no modification to the underlying digitizer. The core mechanism is \textbf{tool grounding}: anchoring VLM assessment in quantitative evidence from domain-specific signal analysis tools. In a controlled ablation on 200 records with paired ground truth, tool grounding raises verdict consistency from 71\% to 89\% and doubles fidelity separation ($Δ$PCC 0.03 $\rightarrow$ 0.08), with the effect replicating across three VLMs (Claude Opus~4, GPT-4o, Gemini~2.5 Pro), confirming a pattern-level rather than model-specific gain. Deployed across four backends, the module improves every one: 29.4\% of borderline leads improved on our pipeline; 41.2\% of failed limb leads recovered on ECG-Digitiser; valid leads per image doubled on Open-ECG-Digitizer (2.5 $\rightarrow$ 5.8). On 428 real clinical HCM images, the integrated system reaches 98.0\% Excellent quality. Both the plug-in architecture and tool-grounding mechanism are domain-parametric, suggesting broader applicability wherever quality criteria are objectively measurable.[94] Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions
Yuchen Yang,Shuangyang Zhong,Haijun Yu,Langcuomu Suo,Hongbin Han,Florian Putz,Yixing Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于VAE-MMD的无监督域自适应方法,有效提升了脑转移瘤跨机构分割性能,无需目标域标签。
Details
Motivation: 深度学习模型在单中心训练后跨机构泛化能力差,受限于扫描设备、成像协议和人群差异,亟需一种不依赖目标域标注的域自适应框架。 Method: 提出VAE-MMD预处理流程,结合变分自编码器与最大均值差异损失,并引入跳跃连接和自注意力机制,联合nnU-Net进行分割;在四个公开数据库(Stanford、UCSF、UCLM、PKG)共740例患者上验证。 Result: 域分类器准确率从0.91降至0.50,表明特征对齐成功;重建图像PSNR >36 dB;F1均值提升11.1%,表面Dice提升7.93%,HD95降低65.5%。 Conclusion: VAE-MMD能有效缓解跨机构数据异质性,在体素、检出和边界层面均提升分割泛化性,且无需目标域标注,推动AI辅助分割临床落地。 Abstract: Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.[95] COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving
Seohyoung Park,Jaeyeol Lim,Seoyoung Ju,Kyeonghun Kim,Nam-Joon Kim,Hyuk-Jae Lee
Main category: cs.CV
TL;DR: 本文研究了Query-Centric Trajectory Prediction (QCNet)模型从美国数据迁移到韩国道路环境的适应性,比较了四种训练策略,发现冻结编码器、仅微调解码器在精度与训练效率间取得了最佳平衡,预测误差降低超66%。
Details
Motivation: 现有主流轨迹预测数据集(如Waymo、Argoverse)主要来自西方道路环境,难以反映韩国等地区的交通模式、基础设施和驾驶行为,导致模型跨域部署时性能下降。 Method: 在韩国自动驾驶数据集上,对比零样本迁移、从头训练、全量微调和冻结编码器四种训练策略,评估QCNet模型的跨域适应能力。 Result: 冻结编码器并仅微调解码器的效果最优,预测误差较从头训练降低超66%,兼顾精度与训练效率。 Conclusion: 预训练知识对跨地域轨迹预测模型迁移至关重要,选择性微调(如冻结编码器)是高效适配新地理域的实用策略。 Abstract: Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.[96] The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation
Xusheng He,Canyang Wu,Jinrong Zhang,Weili Guan,Jianlong Wu,Liqiang Nie
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的三阶段视频对象分割方法,在PVUW 2026 MeViS-Text挑战赛中夺冠,融合Gemini-3.1 Pro、SAM3和Qwen3.5-Plus实现高精度运动语义理解与掩码生成。
Details
Motivation: 解决面向运动中心语言表达的参考视频对象分割(RVOS)任务,需联合建模外观、时序行为与对象交互,但现有方法依赖大量标注与微调。 Method: 构建无训练三阶段流水线:1)Gemini-3.1 Pro解析语言事件、选关键帧并生成判别性描述;2)SAM3-agent在该帧生成种子掩码,官方SAM3跟踪器跨帧传播;3)Qwen3.5-Plus结合行为级验证进行语义一致性修正。 Result: 在PVUW 2026 MeViS-Text测试集上取得第一名,Final得分为0.909064,J&F得分为0.7897,且无需任务特定微调。 Conclusion: 证明了强大多模态大模型与视觉基础模型(如SAM3)的协同推理能力足以在复杂运动语言驱动的视频分割任务中替代传统监督学习范式。 Abstract: This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.[97] Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Halima Bouzidi,Haoyu Liu,Yonatan Gizachew Achamyeleh,Praneetsai Vasu Iddamsetty,Mohammad Abdullah Al Faruque
Main category: cs.CV
TL;DR: 本文提出FADE攻击框架,针对基于查询传播的多目标跟踪(TBP)方法,通过时间查询泛滥和时间记忆破坏两种策略,有效干扰其跟踪性能,并通过可微分物理仿真优化攻击的现实可行性。
Details
Motivation: 现有Tracking-by-Query-Propagation(TBP)方法虽提升了端到端多目标跟踪性能,但其查询传播机制存在未被探索的对抗脆弱性。 Method: 提出FADE攻击框架,包含两种策略:(i) 时间查询泛滥——生成虚假但时间一致的轨迹查询以耗尽查询预算;(ii) 时间记忆破坏——通过状态去相关与特征身份擦除攻击查询更新器的记忆;并构建可微分管道,结合感知传感器欺骗仿真提升物理世界可行性。 Result: 在MOT17和MOT20数据集上验证了FADE对前沿TBP跟踪器的有效性,显著增加ID切换与轨迹终止。 Conclusion: TBP类跟踪器在对抗场景下存在严重安全隐患,FADE揭示了其架构级脆弱性,为鲁棒跟踪算法设计提供了新警示与评估基准。 Abstract: Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces unexplored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for physical-world realizability by leveraging simulations of advanced perception sensor spoofing. Experiments on MOT17 and MOT20 benchmarks demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.[98] First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
Jiwoo Ha,Jongwoo Baek,Jinhyun So
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的简单有效方法First Logit Boosting (FLB),通过复用首个生成token的logit来缓解大视觉语言模型(LVLMs)中视觉信息随生成过程衰减的问题,从而显著减少物体幻觉,且几乎不增加推理开销。
Details
Motivation: 现有LVLMs存在物体幻觉问题;虽有重训练和外部对齐等方法,但成本高或结构复杂;而训练免费方法如对比解码(CD)仍存在长程衰减问题,即视觉对齐随生成推进而减弱、语言先验主导。 Method: 提出First Logit Boosting(FLB):在解码过程中,保存第一个生成token对应的logit,并将其加到后续每个token预测的logit上,以持续强化初始视觉信息,并利用'The'等稳定token抑制幻觉词。 Result: FLB在多个任务、基准和骨干模型上显著降低物体幻觉;保持首token视觉信息贯穿生成过程;通过'The' token的稳定效应抑制幻觉词;推理开销可忽略,适用于实时多模态系统。 Conclusion: FLB是一种轻量、通用、训练免费的视觉对齐增强技术,有效缓解LVLMs中的长程衰减与物体幻觉问题,具备强实用性与部署友好性。 Abstract: Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at https://github.com/jiwooha20/FLB[99] Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation
Michael Maynord,Minghui Liu,Cornelia Fermüller,Seongjin Choi,Yuxin Zeng,Shishir Dahal,Daniel M. Harrison
Main category: cs.CV
TL;DR: 本文研究了在7T MRI图像上自动分割多发性硬化症(MS)白质病变(WML)的方法,发现传统基于低场MRI开发的工具(如LST-LPA和LST-AI)在7T图像上表现不佳;作者构建了经专家修正的7T FLAIR参考标注,并训练了基于Transformer的3D UNETR和SegFormer模型,在原生0.5mm³分辨率下显著提升了小病灶检出能力,尤其优于经典方法;最终开源了预训练模型以支持超高场MS研究。
Details
Motivation: 7T MRI虽能更好显示MS白质病变,但其对比度和伪影与常规1.5–3T图像差异大,导致现有自动化分割工具(如LST系列)无法直接适用,亟需适配7T数据的新方法。 Method: 基于专家修订的LST生成7T FLAIR参考病变掩膜;对比外部工具LST-LPA与LST-AI;训练并评估3D UNETR与SegFormer两种Transformer模型,在三种体素分辨率(0.5³、1.0³、1.5×1.5×2.0 mm³)下进行;采用BraTS 2023标准进行体素级与病灶级评估。 Result: SegFormer在原生0.5mm³分辨率测试集上达到体素Dice 0.61、病灶Dice 0.20,显著优于LST-LPA(0.39 / 0.02);模型在更高分辨率下表现更优,验证了原生7T数据对小病灶检测的重要性;Transformer模型能检出传统方法遗漏的小病灶,但存在边界波动与伪影相关假阳性。 Conclusion: 专为7T MRI训练的Transformer模型(尤其是SegFormer)在MS白质病变分割任务中优于传统工具,尤其提升小病灶敏感性;强调使用原生高分辨率数据的必要性;所开源模型可作为超高场MS研究中可复现、即用的自动化定量分析资源。 Abstract: Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (https://github.com/maynord/7T-MS-lesion-segmentation).[100] All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Peter Tu,Jing Zhang
Main category: cs.CV
TL;DR: 本文揭示了强化学习(特别是GRPO)在提升视觉语言模型推理能力时存在的'多样性坍缩'问题,并提出MUPO方法以鼓励多路径发散思考,从而改善模型的泛化性与可扩展性。
Details
Motivation: 现有研究虽发现RL(如GRPO)能增强VLM推理能力,但其有效机制与内在局限尚不明确;作者观察到RL模型倾向于深度但狭窄的推理,而基础模型则更宽泛多样,由此引出对多样性缺失问题的探究。 Method: 通过分析训练动态,识别GRPO中的多样性坍缩现象;进而提出Multi-Group Policy Optimization (MUPO),一种鼓励跨多个解路径进行发散思考的简单有效优化方法。 Result: MUPO在多个标准基准上验证有效,显著缓解了多样性坍缩,提升了模型的推理广度、泛化能力和可扩展性。 Conclusion: 多样性是提升VLM推理能力的关键维度;MUPO为构建更鲁棒、更可扩展的推理型VLM提供了新思路和实用方案。 Abstract: Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/[101] A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation
Yabin Zhang,Chong Wang,Yunhe Gao,Jiaming Liu,Maya Varma,Justin Xu,Sophie Ostmeier,Jin Long,Sergios Gatidis,Seena Dehkharghani,Arne Michalson,Eun Kyoung Hong,Christian Bluethgen,Haiwei Henry Guo,Alexander Victor Ortiz,Stephan Altmayer,Sandhya Bodapati,Joseph David Janizek,Ken Chang,Jean-Benoit Delbrouck,Akshay S. Chaudhari,Curtis P. Langlotz
Main category: cs.CV
TL;DR: 本文提出CheXOne,一种具备推理能力的视觉-语言模型,用于胸部X光片(CXR)解读,能同时生成诊断预测与临床依据充分的推理过程,在多项零样本任务中超越现有模型,并在临床评估中表现接近甚至优于住院医师报告。
Details
Motivation: 当前AI系统在CXR解读中多仅输出最终预测,缺乏可解释的视觉证据到诊断结论的推理过程,易导致误诊且难以获得临床信任。 Method: 提出CheXOne模型,采用两阶段训练框架:先指令微调,再强化学习优化推理质量;训练数据涵盖30个公开数据集、1470万条指令与推理样本,覆盖36项CXR任务。 Result: 在17种零样本评估设置(含视觉问答、报告生成、视觉定位与推理评估)中均优于现有医学及通用基础模型;临床读者研究表明其生成报告在55%案例中媲美或优于住院医师;推理轨迹具有高临床事实性与因果支持力。 Conclusion: 显式推理不仅能提升模型性能,还能增强可解释性与临床实用性,为AI辅助CXR解读提供了新范式。 Abstract: Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.[102] ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction
Quanyuan Ruan,Kewei Shi,Jiabao Lei,Xifeng Gao,Xiaoguang Han
Main category: cs.CV
TL;DR: 本文提出了一种名为自动回归高斯点绘(ARGS)的框架,用于3D对象生成中的多尺度预测,通过高斯简化与逆向恢复策略及树状Transformer实现高效、可控的多尺度高斯表示生成。
Details
Motivation: 将已成功应用于2D图像生成的自回归框架扩展至3D物体生成仍属空白,亟需一种能有效建模3D多尺度结构的方法。 Method: 提出自动回归高斯点绘(ARGS),包含高斯简化与逆向恢复策略,并构建基于层次树的树状Transformer,使叶节点可关注其祖先节点以增强结构一致性;生成复杂度为O(log n)。 Result: 实验表明该方法能高效生成具有可控细节层次、高视觉保真度且计算开销合理的多尺度高斯表示。 Conclusion: ARGS为3D内容生成提供了一种新颖、高效且结构一致的自回归多尺度建模范式,显著推进了3D生成建模能力。 Abstract: Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only \(\mathcal{O}(\log n)\) steps, where \(n\) is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.[103] PC-SAM: Patch-Constrained Fine-Grained Interactive Road Segmentation in High-Resolution Remote Sensing Images
Chengcheng Lv,Rushi Li,Mincheng Wu,Xiufang Shi,Zhenyu Wen,Shibo He
Main category: cs.CV
TL;DR: 本文提出PC-SAM,一种融合全自动与交互式分割的统一框架,通过约束点提示影响范围至对应图像块,实现遥感影像道路掩码的精细局部修正与分割,显著优于现有全自动方法。
Details
Motivation: 现有全自动道路分割方法难以处理复杂路段、易产生误检/漏检,且不支持局部细化;SAM虽适用于自然图像,但在遥感道路分割中性能差、无法支持细粒度局部修正。 Method: 提出PC-SAM框架,设计精细化微调策略,将点提示的影响限制在对应图像块内,从而实现全自动分割与基于点提示的交互式局部精修的统一。 Result: 在多个遥感道路分割数据集上实验表明,结合点提示时PC-SAM显著超越当前最优全自动模型,并支持灵活的局部掩码细化和局部道路分割。 Conclusion: PC-SAM有效解决了遥感道路分割中全自动方法鲁棒性不足与交互式方法精度低的问题,为高精度、可交互的道路提取提供了新范式。 Abstract: Road masks obtained from remote sensing images effectively support a wide range of downstream tasks. In recent years, most studies have focused on improving the performance of fully automatic segmentation models for this task, achieving significant gains. However, current fully automatic methods are still insufficient for identifying certain challenging road segments and often produce false positive and false negative regions. Moreover, fully automatic segmentation does not support local segmentation of regions of interest or refinement of existing masks. Although the SAM model is widely used as an interactive segmentation model and performs well on natural images, it shows poor performance in remote sensing road segmentation and cannot support fine-grained local refinement. To address these limitations, we propose PC-SAM, which integrates fully automatic road segmentation and interactive segmentation within a unified framework. By carefully designing a fine-tuning strategy, the influence of point prompts is constrained to their corresponding patches, overcoming the inability of the original SAM to perform fine local corrections and enabling fine-grained interactive mask refinement. Extensive experiments on several representative remote sensing road segmentation datasets demonstrate that, when combined with point prompts, PC-SAM significantly outperforms state-of-the-art fully automatic models in road mask segmentation, while also providing flexible local mask refinement and local road segmentation. The code will be available at https://github.com/Cyber-CCOrange/PC-SAM.[104] PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
Weifu Fu,Jinyang Li,Bin-Bin Gao,Jialin Li,Yuhuan Lin,Hanqiu Deng,Wenbing Tao,Yong Liu,Chengjie Wang
Main category: cs.CV
TL;DR: 本文提出PET-DINO,一种支持文本和视觉提示的通用开放集目标检测器,通过AFVPG模块及两种提示增强训练策略(IBP与DMD),提升对罕见类别和复杂场景的零样本检测能力。
Details
Motivation: 现有开放集目标检测方法在文本-视觉对齐、稀有类别图像-文本对稀缺、多模态设计复杂及缺乏有效数据驱动训练策略等方面存在不足。 Method: 提出PET-DINO框架,包含Alignment-Friendly Visual Prompt Generation(AFVPG)模块,并引入Intra-Batch Parallel Prompting(IBP)和Dynamic Memory-Driven Prompting(DMD)两种提示增强训练策略。 Result: PET-DINO在多种基于提示的目标检测协议下展现出具有竞争力的零样本检测性能,尤其在专业领域和复杂对象上表现更优。 Conclusion: 继承式设计哲学与提示增强训练策略共同构成了构建高效通用目标检测器的关键,PET-DINO为开放集检测提供了更简洁、通用且实用的解决方案。 Abstract: Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.[105] RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
Jihwan Park,Chanhyeong Yang,Jinyoung Park,Taehoon Song,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: 本文提出Relational Grounding Transformer (RegFormer),一种在图像级监督下进行高效准确人-物交互(HOI)检测的新方法,通过空间定位信号引导推理,实现从图像级到实例级HOI推理的直接迁移。
Details
Motivation: 现有弱监督HOI检测方法依赖外部检测器生成候选对并进行两两推理,计算开销大且易产生非交互组合的误检,难以扩展和准确实例级推理。 Method: 提出RegFormer模块,利用空间定位信号作为推理引导,促进局部感知的交互学习,无需额外训练即可将图像级交互推理直接迁移到实例级。 Result: RegFormer能有效学习用于实例级交互推理的空间线索,运行高效,并达到与全监督模型相当的性能。 Conclusion: RegFormer为弱监督HOI检测提供了一种高效、准确且可扩展的新范式,克服了传统两阶段方法的计算与误检瓶颈。 Abstract: Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.[106] MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning
Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Junsu Lim,YeonJu Jean,Seongbin Park,Eunseob Choi,Hyunsu Go,SeoYoung Ju,Seohyoung Park,Gyeongmin Kim,MinJu Kwon,KyungSeok Yuh,Soo Yong Kim,Ken Ying-Kai Liao,Nam-Joon Kim,Hyuk-Jae Lee
Main category: cs.CV
TL;DR: 本文提出MAESIL,一种专为3D医学影像设计的自监督学习框架,通过引入‘superpatch’和双掩码3D掩码自编码策略,有效建模3D结构信息,在CT数据上显著优于AE、VAE等基线方法。
Details
Motivation: 现有自监督学习方法多将3D CT切片视为独立2D图像处理,丢失轴向一致性和3D结构上下文;而自然图像预训练存在领域偏移问题,难以适配3D医学影像。 Method: 提出MAESIL框架:采用3D 'superpatch'作为输入单元,结合3D掩码自编码器与双掩码策略,对CT体数据进行分块重建式自监督学习,以保留3D结构并兼顾计算效率。 Result: 在三个大规模公开CT数据集上验证,MAESIL在PSNR和SSIM等重建指标上显著优于AE、VAE和VQ-VAE等方法。 Conclusion: MAESIL是一种鲁棒且实用的3D医学影像预训练方案,有效解决了传统SSL忽略3D结构的问题,为下游任务提供高质量特征表示。 Abstract: Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the 'superpatch', a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.[107] Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition
Axiu Mao,Meilu Zhu,Lei Shen,Xiaoshuai Wang,Tomas Norton,Kai Liu
Main category: cs.CV
TL;DR: 本文提出了一种面向个体行为感知的网络IBA-Net,通过多采样率特征定制(MFC)与神经坍缩驱动的分类器校准(NC3)模块,解决农场动物活动中特定行为识别准确率低及类别不平衡问题,在山羊、牛、马三个公开数据集上均取得最优性能。
Details
Motivation: 现有基于可穿戴传感器的动物活动识别研究过于关注整体准确率,忽视了特定行为类别识别效果差的问题,主要源于采样率选择不当和类别不平衡。 Method: 提出IBA-Net:1)Mixture-of-Experts(MoE)驱动的Feature Customization(MFC)模块,自适应融合多采样率数据以提取行为特异性特征;2)Neural Collapse-driven Classifier Calibration(NC3)模块,引入固定等角紧框架(ETF)分类器,增大类间分类向量夹角以缓解类别不平衡偏差。 Result: 在山羊、牛、马三类动物活动识别的三个公开数据集上,IBA-Net在所有行为类别上的识别准确率均显著优于现有方法。 Conclusion: IBA-Net通过行为感知的特征定制与分类器校准协同优化,有效提升了各类别尤其是少数类行为的识别性能,为精细化畜牧管理与动物福利监测提供了可靠技术支撑。 Abstract: With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.[108] Learnability-Guided Diffusion for Dataset Distillation
Jeffrey A. Chan-Santiago,Mubarak Shah
Main category: cs.CV
TL;DR: 本文提出了一种基于可学习性(learnability)驱动的数据集蒸馏方法,通过渐进式生成合成样本,减少冗余、提升互补性,在多个图像数据集上达到SOTA性能。
Details
Motivation: 现有基于扩散模型的数据集蒸馏方法存在样本间信息重叠严重(冗余)的问题,因仅关注视觉多样性或平均训练动力学,未考虑样本间的信息相似性。 Method: 提出可学习性驱动的数据集蒸馏框架,包括:1)增量式构建合成数据集;2)每阶段用当前模型评估新样本的可学习性得分;3)设计Learnability-Guided Diffusion(LGD),联合优化对当前模型的训练效用与对参考模型的有效性。 Result: 在ImageNet-1K、ImageNette和ImageWoof上分别达到60.1%、87.2%、72.9%准确率;冗余降低39.1%,样本专业化程度提升。 Conclusion: 可学习性引导的渐进式蒸馏能有效缓解样本冗余,构建更具互补性和教学效率的合成数据集,为高效模型训练提供新范式。 Abstract: Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.[109] Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
Haibo Wang,Zihao Lin,Zhiyang Xu,Lifu Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为“Think, Act, Build (TAB)”的动态代理框架,将3D视觉定位(3D-VG)重构为基于原始RGB-D流的2D到3D生成式重建任务,通过解耦语义理解(由2D VLM完成)与几何重建(基于多视角几何),实现零样本3D-VG,并在ScanRefer和Nr3D上超越现有零样本及部分监督方法。
Details
Motivation: 现有基于VLM的3D-VG方法依赖预处理点云、流程静态,退化为提案匹配;本文旨在解耦任务——用2D VLM解析复杂空间语义,用确定性多视角几何构建3D结构,摆脱对静态点云的依赖。 Method: 提出TAB框架:1)VLM代理调用专用3D-VG技能与视觉工具,在RGB-D视频流中动态跟踪并重建目标;2)引入语义锚定的几何扩展机制:先在参考视频片段中锚定目标,再利用多视角几何将其位置传播至未观测帧;3)通过相机参数聚合多视角特征,将2D线索直接映射为3D坐标。 Result: 在ScanRefer和Nr3D数据集上,仅使用开源模型的TAB框架显著优于所有零样本方法,甚至超越部分全监督基线;同时人工修正了现有基准中参考歧义和类别错误等评估缺陷。 Conclusion: 解耦语义与几何、以动态代理驱动2D-to-3D重建是实现高性能零样本3D-VG的有效范式;TAB验证了纯开源模型在该任务上的强大潜力,并推动了更严谨的基准建设。 Abstract: 3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.[110] AceTone: Bridging Words and Colors for Conditional Image Grading
Tianren Ma,Mingxiang Liao,Xijin Zhang,Qixiang Ye
Main category: cs.CV
TL;DR: 本文提出AceTone,首个支持文本或参考图像多模态条件驱动的统一颜色分级框架,通过VQ-VAE量化3D-LUT、构建大规模数据集并结合强化学习优化感知与美学对齐,显著提升性能与人类偏好。
Details
Motivation: 现有颜色分级方法依赖局部重着色或固定滤波器组,难以泛化至多样创意意图,且无法契合人类审美偏好。 Method: 提出AceTone框架:将颜色分级建模为生成式颜色变换任务,直接生成条件于文本/参考图像的3D-LUT;设计VQ-VAE tokenizer将3×32³ LUT压缩为64个离散token(ΔE<2);构建AceTone-800K数据集;训练视觉语言模型预测LUT token,并用强化学习对齐感知保真度与美学。 Result: 在文本引导和参考引导分级任务上达到SOTA,LPIPS指标最高提升50%;人工评估证实其结果视觉悦目、风格一致。 Conclusion: AceTone为语言驱动、美学对齐的颜色分级提供了新范式,推动了图像风格编辑的智能化与人性化发展。 Abstract: Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $ΔE<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.[111] FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography
Wei Qian,Dan Guo,Jinxing Zhou,Bochao Zou,Zitong Yu,Meng Wang
Main category: cs.CV
TL;DR: FreqPhys是一种基于频率引导的远程光电容积描记(rPPG)框架,通过生理频带滤波、频谱调制与自适应选择、跨域表征学习及频率感知条件扩散过程,提升运动和光照干扰下的无接触生理信号恢复鲁棒性。
Details
Motivation: 现有rPPG方法多依赖时域建模,易受运动伪影和光照变化干扰,导致微弱生理信号被噪声淹没。 Method: 提出FreqPhys框架:包含生理带通滤波模块抑制带外干扰;生理频谱调制与自适应谱选择增强脉搏相关频段并抑制带内残余噪声;跨域表征学习融合频谱先验与深度时域特征;频率感知条件扩散过程逐步重建高保真rPPG信号。 Result: 在六个基准数据集上显著优于现有最先进方法,尤其在强运动条件下表现突出。 Conclusion: 显式建模生理频率先验对提升rPPG鲁棒性至关重要。 Abstract: Remote photoplethysmography (rPPG) enables contactless physiological monitoring by capturing subtle skin-color variations from facial videos. However, most existing methods predominantly rely on time-domain modeling, making them vulnerable to motion artifacts and illumination fluctuations, where weak physiological clues are easily overwhelmed by noise. To address these challenges, we propose FreqPhys, a frequency-guided rPPG framework that explicitly leverages physiological frequency priors for robust signal recovery. Specifically, FreqPhys first applies a Physiological Bandpass Filtering module to suppress out-of-band interference, and then performs Physiological Spectrum Modulation together with adaptive spectral selection to emphasize pulse-related frequency components while suppress residual in-band noise. A Cross-domain Representation Learning module further fuses these spectral priors with deep time-domain features to capture informative spatial--temporal dependencies. Finally, a frequency-aware conditional diffusion process progressively reconstructs high-fidelity rPPG signals. Extensive experiments on six benchmarks demonstrate that FreqPhys yields significant improvements over state-of-the-art approaches, particularly under challenging motion conditions. It highlights the importance of explicitly modeling physiological frequency priors. The source code will be released.[112] MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy
Kyeonghun Kim,Jaehyung Park,Youngung Han,Anna Jung,Seongbin Park,Sumin Lee,Jiwon Yang,Jiyoon Han,Subeen Lee,Junsu Lim,Hyunsu Go,Eunseob Choi,Hyeonseok Jung,Soo Yong Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Hyuk-Jae Lee,Ken Ying-Kai Liao,Nam-Joon Kim
Main category: cs.CV
TL;DR: 本文提出MATHENA框架,利用Mamba的线性复杂度状态空间模型统一解决牙科OPG图像中的牙齿检测、龋齿分割、异常检测和发育分期四大任务,并构建了包含15062个标注样本的PARTHENON基准数据集。
Details
Motivation: 牙科OPG诊断需协同完成牙齿检测、龋齿分割、异常检测和发育分期四个任务,现有方法多为单任务设计,缺乏统一高效框架。 Method: 提出基于Mamba的MATHENA框架,包含MATHE(多分辨率SSM驱动检测器)和HENA(轻量级Mamba-UNet三头网络),采用上游任务预训练+下游任务冻结微调/线性探针策略,并构建PARTHENON基准数据集。 Result: MATHENA在牙齿检测(mAP@50=93.78%)、龋齿分割(Dice=90.11%)、异常检测(88.35%)和发育分期(ACC=72.40%)上均取得优异性能。 Conclusion: MATHENA验证了基于Mamba的状态空间模型在多任务牙科影像分析中的有效性与高效性,为统一牙科诊断框架提供了新范式。 Abstract: Dental diagnosis from Orthopantomograms (OPGs) requires coordination of tooth detection, caries segmentation (CarSeg), anomaly detection (AD), and dental developmental staging (DDS). We propose Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy (MATHENA), a unified framework leveraging Mamba's linear-complexity State Space Models (SSM) to address all four tasks. MATHENA integrates MATHE, a multi-resolution SSM-driven detector with four-directional Vision State Space (VSS) blocks for O(N) global context modeling, generating per-tooth crops. These crops are processed by HENA, a lightweight Mamba-UNet with a triple-head architecture and Global Context State Token (GCST). In the triple-head architecture, CarSeg is first trained as an upstream task to establish shared representations, which are then frozen and reused for downstream AD fine-tuning and DDS classification via linear probing, enabling stable, efficient learning. We also curate PARTHENON, a benchmark comprising 15,062 annotated instances from ten datasets. MATHENA achieves 93.78% mAP@50 in tooth detection, 90.11% Dice for CarSeg, 88.35% for AD, and 72.40% ACC for DDS.[113] TRiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting
Suwoong Yeom,Joonsik Nam,Seunggyu Choi,Lucas Yunkyu Lee,Sangmin Kim,Jaesik Park,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong,Sukju Kang
Main category: cs.CV
TL;DR: 本文提出TRiGS方法,通过统一连续的几何变换(结合SE(3)变换、分层Bezier残差和可学习局部锚点)建模动态场景中高斯元的刚性运动,解决现有4D高斯溅射方法因分段线性速度近似导致的时间碎片化与高斯数量爆炸问题,显著提升长视频序列下的渲染质量与时间稳定性。
Details
Motivation: 现有4D高斯溅射方法依赖分段线性速度近似和短时窗建模,导致时间碎片化、高斯反复生成/消除、长期时间身份丢失及内存不可控增长,难以扩展至长视频序列。 Method: 提出TRiGS:一种基于统一连续几何变换的4D表示法,融合SE(3)变换建模刚体运动、分层Bezier残差表达非线性形变、可学习局部锚点实现个体高斯元的几何一致性建模。 Result: 在标准基准上实现高保真渲染,并首次在600–1200帧长视频序列上稳定扩展,显著优于先前方法,缓解内存瓶颈并提升时间稳定性。 Conclusion: TRiGS通过连续几何建模有效保持高斯元的时间身份与几何一致性,为长时动态场景重建提供了可扩展、稳定且高效的4D表示新范式。 Abstract: Recent 4D Gaussian Splatting (4DGS) methods achieve impressive dynamic scene reconstruction but often rely on piecewise linear velocity approximations and short temporal windows. This disjointed modeling leads to severe temporal fragmentation, forcing primitives to be repeatedly eliminated and regenerated to track complex nonlinear dynamics. This makeshift approximation eliminates the long-term temporal identity of objects and causes an inevitable proliferation of Gaussians, hindering scalability to extended video sequences. To address this, we propose TRiGS, a novel 4D representation that utilizes unified, continuous geometric transformations. By integrating $SE(3)$ transformations, hierarchical Bezier residuals, and learnable local anchors, TRiGS models geometrically consistent rigid motions for individual primitives. This continuous formulation preserves temporal identity and effectively mitigates unbounded memory growth. Extensive experiments demonstrate that TRiGS achieves high fidelity rendering on standard benchmarks while uniquely scaling to extended video sequences (e.g., 600 to 1200 frames) without severe memory bottlenecks, significantly outperforming prior works in temporal stability.[114] Neuropsychiatric Deviations From Normative Profiles: An MRI-Derived Marker for Early Alzheimer's Disease Detection
Synne Hjertager Osenbroch,Lisa Ramona Rosvold,Yao Lu,Alvaro Fernandez-Quilez
Main category: cs.CV
TL;DR: 本文提出一种基于深度学习的规范性建模框架,利用结构MRI预测神经精神症状(NPS)负担,通过预测与实际NPIQ得分的偏差(DNPI)识别阿尔茨海默病(AD)早期风险,DNPI可有效预测未来AD转化,性能媲美脑脊液Aβ42。
Details
Motivation: 现有工具难以区分神经精神症状(NPS)是正常老化表现还是阿尔茨海默病(AD)早期征兆,限制了其作为早期生物标志物的应用价值。 Method: 构建基于3D卷积神经网络的规范性模型,在ADNI队列中使用认知稳定受试者的结构MRI数据学习脑解剖与NPIQ评分之间的映射关系,并定义预测值与实测值之差为DNPI(Divergence from NPIQ scores)。 Result: DNPI显著关联未来AD转化(校正OR=2.5,p<0.01),预测准确率AUC达0.74,与金标准脑脊液Aβ42(AUC=0.75)相当。 Conclusion: 该非侵入性、可扩展的影像-行为规范性建模方法有望成为AD早期检测的新策略。 Abstract: Neuropsychiatric symptoms (NPS) such as depression and apathy are common in Alzheimer's disease (AD) and often precede cognitive decline. NPS assessments hold promise as early detection markers due to their correlation with disease progression and their non-invasive nature. Yet current tools cannot distinguish whether NPS are part of aging or early signs of AD, limiting their utility. We present a deep learning-based normative modelling framework to identify atypical NPS burden from structural MRI. A 3D convolutional neural network was trained on cognitively stable participants from the Alzheimer's Disease Neuroimaging Initiative, learning the mapping between brain anatomy and Neuropsychiatric Inventory Questionnaire (NPIQ) scores. Deviations between predicted and observed scores defined the Divergence from NPIQ scores (DNPI). Higher DNPI was associated with future AD conversion (adjusted OR=2.5; p < 0.01) and achieved predictive accuracy comparable to cerebrospinal fluid AB42 (AUC=0.74 vs 0.75). Our approach supports scalable, non-invasive strategies for early AD detection.[115] Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations
Youyu Chen,Junjun Jiang,Yueru Luo,Kui Jiang,Xianming Liu,Xu Yan,Dave Zhenyu Chen
Main category: cs.CV
TL;DR: 本文提出Reliev3R,一种无需昂贵多视角几何标注的弱监督范式,用于从零训练前馈重建模型(FFRMs),利用预训练模型提供的单目相对深度和稀疏图像对应关系,并设计了模糊感知的相对深度损失和基于三角测量的重投影损失,实现了与全监督方法相当的性能。
Details
Motivation: 现有前馈重建模型(FFRMs)过度依赖多视角几何标注(如3D点图和相机位姿),导致全监督训练难以扩展。 Method: 提出Reliev3R弱监督训练范式,利用预训练模型输出的单目相对深度和图像稀疏对应关系;设计模糊感知的相对深度损失和基于三角测量的重投影损失,以实现多视角几何一致性监督。 Result: 仅用更少、成本更低的数据从零训练,Reliev3R性能可媲美全监督的同类模型。 Conclusion: Reliev3R推动了低成本3D重建监督和可扩展FFRMs的发展。 Abstract: With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.[116] TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
Zhijin He,Shuo Jin,Siyue Yu,Shuwei Wu,Bingfeng Zhang,Li Yu,Jimin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的协同显著性目标检测(CoSOD)方法TF-SSD,结合SAM生成候选掩码、DINO注意力图进行图像内显著性筛选,并设计跨图像原型选择器实现组间显著性一致性建模,在多个数据集上显著超越现有方法。
Details
Motivation: 现有基于训练的CoSOD方法受限于闭集数据集,泛化能力差;而视觉基础模型(VFMs)具备强泛化性和鲁棒的显著性理解能力,但尚未被充分探索用于CoSOD任务。 Method: 提出无训练方法TF-SSD:1)用SAM生成原始掩码候选池;2)构建基于SAM的质量掩码生成器过滤冗余掩码;3)利用DINO注意力图设计图像内显著性滤波器;4)提出跨图像原型选择器,通过计算跨图原型相似度选出高分掩码作为最终预测。 Result: 在多个标准数据集上显著优于现有方法,相较最新无训练方法提升达13.7%;代码已开源。 Conclusion: VFMs(特别是SAM与DINO协同)可有效支撑无训练CoSOD,TF-SSD验证了其优越性能与泛化能力,为CoSOD提供了新范式。 Abstract: Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.[117] STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO
Pukun Zhao,Longxiang Wang,Chen Chen,Peicheng Wang,Fanqing Zhou,Runze Li,Haojian Huang
Main category: cs.CV
TL;DR: 本文提出STAR框架,通过两阶段方法提升大语言模型在结构化空间导航任务中的表现,并引入RedMaze-23K数据集。
Details
Motivation: 现有空间推理范式(如VoT)在复杂拓扑中易产生级联错误,需更鲁棒的结构化导航方法。 Method: STAR为两阶段框架:第一阶段用监督微调使模型内化空间语义并剪枝冗余路径;第二阶段采用空间感知的段级直接偏好优化(SDPO)提升长程导航中的自校正能力;同时构建含人类启发转向点标注的RedMaze-23K数据集。 Result: STAR在开源模型中达到SOTA:其32B变体准确率达29.27%,优于DeepSeek-V3(25.00%),达GPT-4性能的82.4%。 Conclusion: 基于拓扑锚点的STAR框架显著提升了LLMs在复杂空间导航任务中的鲁棒性与准确性,验证了分阶段、结构化训练策略的有效性。 Abstract: Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4's performance.[118] FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning
Tien-Yu Chi
Main category: cs.CV
TL;DR: 本文提出FecalFed,一种用于家禽疾病分类的隐私保护联邦学习框架,并发布了一个去重后的粪便图像数据集poultry-fecal-fl;在非独立同分布(non-IID)条件下,该框架显著提升模型准确率,同时避免敏感数据集中化。
Details
Motivation: 解决农场数据隐私顾虑、机构数据孤岛问题,以及现有公开农业数据集存在严重且未记录的数据污染问题。 Method: 构建隐私保护的联邦学习框架FecalFed;发布经严格去重的8770张图像数据集poultry-fecal-fl;在Dirichlet α=0.5的强非IID设定下评估;采用服务器端自适应优化(FedAdam)与Swin-Small/Swin-Tiny模型。 Result: 单农场训练准确率仅64.86%,而FecalFed结合FedAdam与Swin-Small达90.31%准确率,接近集中式训练上限95.10%;边缘优化的Swin-Tiny亦达89.74%。 Conclusion: FecalFed为农场级禽病监测提供了高效、隐私优先的技术范式,在保障数据隐私前提下实现了高性能疾病分类。 Abstract: Early detection of highly pathogenic avian influenza (HPAI) and endemic poultry diseases is critical for global food security. While computer vision models excel at classifying diseases from fecal imaging, deploying these systems at scale is bottlenecked by farm data privacy concerns and institutional data silos. Furthermore, existing open-source agricultural datasets frequently suffer from severe, undocumented data contamination. In this paper, we introduce $\textbf{FecalFed}$, a privacy-preserving federated learning framework for poultry disease classification. We first curate and release $\texttt{poultry-fecal-fl}$, a rigorously deduplicated dataset of 8,770 unique images across four disease classes, revealing and eliminating a 46.89$\%$ duplication rate in popular public repositories. To simulate realistic agricultural environments, we evaluate FecalFed under highly heterogeneous, non-IID conditions (Dirichlet $α=0.5$). While isolated single-farm training collapses under this data heterogeneity, yielding only 64.86$\%$ accuracy, our federated approach recovers performance without centralizing sensitive data. Specifically, utilizing server-side adaptive optimization (FedAdam) with a Swin-Small architecture achieves 90.31$\%$ accuracy, closely approaching the centralized upper bound of 95.10\%. Furthermore, we demonstrate that an edge-optimized Swin-Tiny model maintains highly competitive performance at 89.74$\%$, establishing a highly efficient, privacy-first blueprint for on-farm avian disease monitoring.[119] HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models
Junhee Lee,Minseok Kim,Hwanjo Heo,Seungwon Woo,Jinwoo Kim
Main category: cs.CV
TL;DR: 本文提出HarassGuard,一种基于视觉-语言模型(VLM)的系统,仅利用视觉输入检测社交VR中的身体骚扰行为,在保证隐私的同时实现高效、上下文感知的主动防护。
Details
Motivation: 现有社交VR平台的安全措施多为被动响应,而主动检测方法常依赖敏感生物特征数据,引发隐私担忧。 Method: 构建经IRB批准的骚扰视觉数据集,结合提示工程与微调视觉-语言模型(VLM),利用视觉输入并融合社交VR上下文信息进行骚扰行为检测。 Result: HarassGuard在二分类准确率达88.09%,多分类达68.85%,性能媲美LSTM/CNN和Transformer等基线模型,且仅需200个微调样本(远少于基线的1115个)。 Conclusion: HarassGuard在隐私保护、小样本学习与上下文推理方面具有显著优势,为社交VR中主动式、非侵入式骚扰检测提供了新范式。 Abstract: Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.[120] Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors
Hiroki Hashimoto,Hiromichi Goto,Hiroyuki Sugai,Hiroshi Kera,Kazuhiko Kawamoto
Main category: cs.CV
TL;DR: 本文提出了一种无需数据增强的轨迹规划方法,利用3D基础模型的几何先验(如像素级3D位置)作为位置嵌入,并通过交叉注意力融合几何特征,提升了在相机视角变化(尤其是俯仰角和高度变化)下的鲁棒性。
Details
Motivation: 现有端到端自动驾驶模型在相机视角变化时性能下降严重,难以泛化到训练未见的视角,亟需提升视角鲁棒性。 Method: 不使用数据增强,而是从深度估计中提取每像素3D位置作为位置嵌入,并通过交叉注意力机制融合3D基础模型提供的中间几何特征。 Result: 在VR-Drive视角扰动基准上,该方法显著缓解了俯仰角和高度扰动下的性能下降;但在纵向平移扰动下增益较小。 Conclusion: 引入显式几何先验可有效提升视角鲁棒性,但需进一步设计更视角无关的特征融合机制以全面应对各类相机位姿变化。 Abstract: Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.[121] KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering
Xianyao Zheng,Hong Yu,Hui Cui,Changming Sun,Xiangyu Li,Ran Su,Leyi Wei,Jia Zhou,Junbo Wang,Qiangguo Jin
Main category: cs.CV
TL;DR: 本文提出了一种知识图谱增强的跨Mamba交互框架(KG-CMI),用于提升医学视觉问答(Med-VQA)性能,通过融合医学知识图谱与多模态特征对齐,并支持自由形式答案生成,在多个数据集上达到SOTA。
Details
Motivation: 现有Med-VQA方法未能充分利用领域专业知识,难以准确关联病灶特征与诊断标准;且分类式方法受限于预定义答案集,无法适应自由形式回答的多样性并丢失语义细节。 Method: 提出KG-CMI框架,包含细粒度跨模态特征对齐(FCFA)、知识图谱嵌入(KGE)、跨模态交互表征(CMIR)和自由形式答案增强的多任务学习(FAMT)四个模块,将医学知识图谱融入图像-文本跨模态建模,并利用开放性问题辅助训练。 Result: 在VQA-RAD、SLAKE和OVQA三个Med-VQA数据集上显著超越现有SOTA方法,并通过可解释性实验进一步验证有效性。 Conclusion: KG-CMI通过结构化医学知识引导跨模态理解与生成,有效提升了Med-VQA的准确性、泛化性与可解释性,为临床决策支持提供了更鲁棒的多模态AI方案。 Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model's capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework's effectiveness.[122] Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent
Daye Kang,Hyeongboo Baek
Main category: cs.CV
TL;DR: 本文发现了一种新的对抗攻击失败模式——质量腐蚀(Quality Corruption, QC),即在对抗攻击下,检测数量保持不变但精度急剧下降。该现象仅在EMS-YOLO这一脉冲神经网络(SNN)检测器中观察到,且现有五种主流防御方法均无法检测或缓解QC,表明当前防御体系可能依赖于针对传统ANN模型的假设,而对SNN等新计算基底缺乏适配。
Details
Motivation: 现有对抗攻击监测与防御工具普遍假设检测精度下降时检测数量也会同步减少,但该假设未经实证检验;作者旨在验证该假设在新型计算基底(如脉冲神经网络SNN)上是否依然成立。 Method: 在四种SNN目标检测器和两种威胁模型(l-infinity和l-2)下,系统评估PGD等标准对抗攻击对检测数量与mAP的影响,并测试五种主流防御组件对QC现象的检测与缓解能力。 Result: 仅EMS-YOLO出现显著QC现象(检测数保持>70%,mAP从0.528降至0.042);其他三种SNN模型未见QC;全部五种标准防御组件均未能识别或缓解QC。 Conclusion: 对抗失败模式具有计算基底依赖性,QC是首个被实证的SNN特有失败模式;当前防御生态可能隐含ANN-centric假设,亟需面向新型硬件/模型基底重新设计鲁棒性评估与防御机制。 Abstract: The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.[123] TALENT: Target-aware Efficient Tuning for Referring Image Segmentation
Shuo Jin,Siyue Yu,Bingfeng Zhang,Chao Yao,Meiqin Liu,Jimin Xiao
Main category: cs.CV
TL;DR: 本文提出TALENT框架,通过目标感知的高效调优解决指代表达图像分割中的非目标激活(NTA)问题,显著提升分割精度。
Details
Motivation: 现有基于参数高效调优(PET)的方法在指代表达图像分割中存在非目标激活(NTA)问题,即视觉特征错误激活与文本描述无关但同类别对象,影响分割准确性。 Method: 提出TALENT框架:1)设计校正代价聚合器(RCA)高效聚合文本所指特征;2)引入目标感知学习机制(TLM),包含上下文成对一致性学习(构建文本-视觉语义关联图)和以目标为中心的对比学习(增强目标定位并抑制无关关联)。 Result: 在G-Ref验证集上mIoU提升2.5%,全面超越现有方法。 Conclusion: TALENT通过协同优化语义关联与目标定位,有效缓解NTA问题,为PET-based RIS提供了更鲁棒、精准的解决方案。 Abstract: Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can't emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, terming it the `non-target activation' (NTA) issue. To address this, we propose a novel framework, TALENT, which utilizes target-aware efficient tuning for PET-based RIS. Specifically, we first propose a Rectified Cost Aggregator (RCA) to efficiently aggregate text-referred features. Then, to calibrate `NTA' into accurate target activation, we adopt a Target-aware Learning Mechanism (TLM), including contextual pairwise consistency learning and target-centric contrastive learning. The former uses the sentence-level text feature to achieve a holistic understanding of the referent and constructs a text-referred affinity map to optimize the semantic association of visual features. The latter further enhances target localization to discover the distinct instance while suppressing associations with other unrelated ones. The two objectives work in concert and address `NTA' effectively. Extensive evaluations show that TALENT outperforms existing methods across various metrics (e.g., 2.5\% mIoU gains on G-Ref val set). Our codes will be released at: https://github.com/Kimsure/TALENT.[124] DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
Zhengxian Yang,Fei Xie,Xutao Xue,Rui Zhang,Taicheng Huang,Yang Liu,Mengqi Ji,Tao Yu
Main category: cs.CV
TL;DR: 本文提出DirectFisheye-GS,将鱼眼相机模型原生集成到3D高斯泼溅(3DGS)框架中,避免图像去畸变预处理带来的信息损失与细节稀释;并设计特征重叠驱动的跨视角联合优化策略,缓解边缘漂浮物问题,提升重建质量。
Details
Motivation: 现有基于鱼眼图像的3DGS方法需先进行图像去畸变,导致边缘黑边(信息损失)和插值拉伸(细节稀释、过拟合低频),削弱鱼眼大视场优势。 Method: 1)将鱼眼相机模型直接嵌入3DGS渲染管线,支持原生鱼眼图像输入;2)提出特征重叠驱动的跨视角联合优化策略,利用高斯在多视角间的几何与光度一致性约束,抑制边缘处异常形变(如过度拉长或放大)。 Result: DirectFisheye-GS在公开数据集上达到或超越当前最优性能,有效消除黑边与模糊/漂浮伪影,保留鱼眼大视场优势。 Conclusion: 原生支持鱼眼输入并引入跨视角协同优化,可显著提升3DGS在广角成像下的重建保真度与鲁棒性,该思路亦可推广至针孔相机系统。 Abstract: 3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.[125] When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images
Loris Cino,Pier Luigi Mazzeo,Alessandro Martella,Giulia Radi,Renato Rossi,Cosimo Distante
Main category: cs.CV
TL;DR: 本研究发现AI在皮肤镜图像诊断中失败的图像,人类专家也难以准确诊断,表明这些图像本身存在固有模糊性,而非算法缺陷;图像质量是导致AI和人类诊断性能同时下降的主要原因。
Details
Motivation: 现有研究多将AI算法性能与人类专家对比,但本文创新性地探究皮肤镜图像本身的内在复杂性,以区分AI失败是源于算法偏差还是图像固有的视觉模糊性。 Method: 通过多种CNN架构进行严格实验,识别出所有模型均系统性误分类的图像子集;随后由皮肤科专家对这些困难图像及对照组图像进行独立诊断评估,并计算Cohen's kappa和Fleiss kappa以量化诊断一致性与共识水平。 Result: AI误分类图像上,专家对真实标签的一致性(Cohen's kappa)从0.61骤降至0.08;专家间共识(Fleiss kappa)也从0.456降至0.275;图像质量被确定为导致AI与人类双重失败的关键因素。 Conclusion: AI与人类在特定皮肤镜图像上的共同诊断失败,揭示了图像固有模糊性(尤其图像质量差)是核心挑战,提示未来研究应更重视数据质量而非单纯优化算法。 Abstract: The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available[126] CL-VISTA: Benchmarking Continual Learning in Video Large Language Models
Haiyang Guo,Yichen Shi,Fei Zhu,Wenzhuo Liu,Hongbo Zhao,Fanhu Zeng,Shijie Ma,Da-Han Wang,Xu-Yao Zhang
Main category: cs.CV
TL;DR: 本文提出CL-VISTA基准,用于评估视频大语言模型(Video-LLMs)的持续学习能力,涵盖8个多样化任务和6种评估协议,揭示了现有持续学习方法在性能、计算效率与内存开销之间的根本权衡。
Details
Motivation: 现有持续学习基准无法有效评估现代预训练视频大语言模型,存在任务冗余高、遗忘现象不显著等问题。 Method: 构建CL-VISTA基准,包含8个覆盖感知、理解与推理的多样化视频任务,并设计涵盖性能、计算效率和内存占用三个维度的6种评估协议,其中性能评估特别加入通用视频理解测试以区分泛化能力与过拟合。 Result: 对10种主流持续学习方法的系统评测表明:不存在在所有维度上均占优的方法;缓解灾难性遗忘的方法往往牺牲泛化能力或带来过高计算与内存开销。 Conclusion: CL-VISTA为视频大语言模型的持续学习提供了更真实、全面的评估标准,揭示了关键权衡,有望推动多模态基础模型持续学习的发展。 Abstract: Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.[127] MoonAnything: A Vision Benchmark with Large-Scale Lunar Supervised Data
Clémentine Grethen,Yuang Shi,Simone Gasparini,Géraldine Morin
Main category: cs.CV
TL;DR: 本文提出了MoonAnything,一个基于真实月球地形和物理渲染的统一基准,提供几何和光度监督,包含LunarGeo(立体图像与深度图)和LunarPhoto(多光照下的真实感图像)两个子集,共13万+样本,支持月面感知、反射率估计等任务,并开源数据与生成工具。
Details
Motivation: 现有月球数据集缺乏几何真值、光度真实性、光照多样性或大规模覆盖,制约了基于学习的月面感知系统发展。 Method: 构建基于真实月球地形和物理渲染的MoonAnything基准,包含两个子集:LunarGeo(提供立体图像、稠密深度图与相机标定)和LunarPhoto(采用空间变化BRDF模型生成多光照真实感图像),并提供完整监督信号。 Result: 发布含130K+样本的综合性月球基准,涵盖几何与光度监督;建立SOTA方法基线;开源全部数据与生成工具。 Conclusion: MoonAnything填补了月球感知数据集在几何与光度联合监督上的空白,不仅服务于月球探测,也为低纹理、高对比度场景及其他无大气天体的视觉算法提供了通用挑战平台。 Abstract: Accurate perception of lunar surfaces is critical for modern lunar exploration missions. However, developing robust learning-based perception systems is hindered by the lack of datasets that provide both geometric and photometric supervision. Existing lunar datasets typically lack either geometric ground truth, photometric realism, illumination diversity, or large-scale coverage. In this paper, we introduce MoonAnything, a unified benchmark built on real lunar topography with physically-based rendering, providing the first comprehensive geometric and photometric supervision under diverse illumination with large scale. The benchmark comprises two complementary sub-datasets : i) LunarGeo provides stereo images with corresponding dense depth maps and camera calibration enabling 3D reconstruction and pose estimation; ii) LunarPhoto provides photorealistic images using a spatially-varying BRDF model, along with multi-illumination renderings under real solar configurations, enabling reflectance estimation and illumination-robust perception. Together, these datasets offer over 130K samples with comprehensive supervision. Beyond lunar applications, MoonAnything offers a unique setting and challenging testbed for algorithms under low-textured, high-contrast conditions and applies to other airless celestial bodies and could generalize beyond. We establish baselines using state-of-the-art methods and release the complete dataset along with generation tools to support community extension: https://github.com/clementinegrethen/MoonAnything.[128] TP-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation
Jiawei Xu,Qiangqiang Zhou,Dandan Zhu,Yong Chen,Yugen Yi,Xiaoqi Zhao
Main category: cs.CV
TL;DR: 本文提出TP-Seg框架,通过任务条件适配器和原型引导解码器,实现统一的医学病灶分割,在8个不同任务上超越现有方法。
Details
Motivation: 现有统一分割方法因共享编码器导致特征纠缠、梯度干扰及病灶判别能力不足,亟需更有效的统一建模方案。 Method: 提出TP-Seg框架:1)任务条件适配器采用双路径专家结构,平衡共享与任务特异性表征;2)原型引导解码器引入可学习任务原型作为语义锚点,并用交叉注意力精细建模前景/背景语义。 Result: TP-Seg在涵盖多种影像模态的8个医学病灶分割任务上,一致优于专用、通用及其它统一分割方法,展现出强泛化性、可扩展性与临床适用性。 Conclusion: TP-Seg为统一医学病灶分割提供了新范式,有效缓解特征纠缠与梯度干扰问题,显著提升跨任务、跨模态性能。 Abstract: Building a unified model with a single set of parameters to efficiently handle diverse types of medical lesion segmentation has become a crucial objective for AI-assisted diagnosis. Existing unified segmentation approaches typically rely on shared encoders across heterogeneous tasks and modalities, which often leads to feature entanglement, gradient interference, and suboptimal lesion discrimination. In this work, we propose TP-Seg, a task-prototype framework for unified medical lesion segmentation. On one hand, the task-conditioned adapter effectively balances shared and task-specific representations through a dual-path expert structure, enabling adaptive feature extraction across diverse medical imaging modalities and lesion types. On the other hand, the prototype-guided task decoder introduces learnable task prototypes as semantic anchors and employs a cross-attention mechanism to achieve fine-grained modeling of task-specific foreground and background semantics. Without bells and whistles, TP-Seg consistently outperforms specialized, general and unified segmentation methods across 8 different medical lesion segmentation tasks covering multiple imaging modalities, demonstrating strong generalization, scalability and clinical applicability.[129] TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning
Soumya Shamarao Jahagirdar,Edson Araujo,Anna Kukleva,M. Jehanzeb Mirza,Saurabhchand Bhati,Samuel Thomas,Brian Kingsbury,Rogerio Feris,James R. Glass,Hilde Kuehne
Main category: cs.CV
TL;DR: 本文提出TTA-Vid方法,利用测试时强化学习实现视频语言模型的零样本自适应,无需标注数据即可在推理阶段对单个或少量视频样本进行实时优化,并通过频次奖励和多臂老虎机帧选择策略提升跨数据集泛化能力。
Details
Motivation: 现有视频推理模型依赖大规模监督数据和多阶段训练,成本高、难迁移;亟需一种无需标注、可快速适配新域的轻量级方法。 Method: 提出测试时自适应视频方法(TTA-Vid):(1)基于多帧子集的步进式推理与批感知频次奖励驱动的测试时参数更新;(2)结合同一奖励机制的多臂老虎机自适应帧选择策略。 Result: TTA-Vid在多个视频推理任务上持续超越现有SOTA方法,即使仅用单一样本或单批次数据即可泛化至整个数据集甚至跨数据集,且完全不依赖真实标签或专用训练划分。 Conclusion: 测试时强化学习为视频时空多模态理解提供了高效、通用、免标注的新范式,显著降低模型部署与适配门槛。 Abstract: Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.[130] A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR
Merveilles Agbeti-messan,Thierry Paquet,Clément Chatelain,Pierrick Tranouez,Stéphane Nicolas
Main category: cs.CV
TL;DR: 本文提出首个基于状态空间模型(SSM)的OCR架构,采用CNN+双向/自回归Mamba,大规模评测表明其在历史报纸OCR任务中精度接近SOTA(如DAN),但推理速度快2倍、内存扩展性更优,并开源代码与高质量标注数据。
Details
Motivation: 现有Transformer-based OCR模型因二次复杂度难以高效处理长文本(如段落级)和大规模部署;历史报纸OCR还需应对退化印刷、复杂版式等挑战。 Method: 设计基于Mamba(线性时间SSM)的OCR架构:CNN视觉编码器 + 双向/自回归Mamba序列建模;统一训练条件下对比CTC、自回归、非自回归解码;基准涵盖Transformer、BiLSTM及主流OCR引擎(Tesseract、TrOCR等)。 Result: 在高置信度标注的历史报纸数据集上,所有神经模型CER约2%;Mamba模型段落级CER为6.07%(略高于DAN的5.24%),但推理快2.05倍、内存增长仅1.26x(vs Transformer的2.30x);跨字体(Fraktur/Antiqua)测试表现稳健。 Conclusion: SSM(特别是Mamba)是Transformer在OCR中高效可扩展的替代方案,在精度-效率权衡中显著优于现有方法,适合大规模文化遗产数字化应用;开源代码、模型与评估协议促进可复现研究。 Abstract: End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.[131] IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
Dong-Jae Lee,Sunghyun Baek,Junmo Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉令牌剪枝框架,基于注意力机制的对偶形式视角,将注意力重构成隐式线性层,并通过选择最优秩1更新子集来剪枝令牌;提出了衡量令牌信息量与冗余度的新指标,并设计了渐进分块最大边际相关算法进行高效选择,在大视觉语言模型中实现了性能与效率的更好平衡。
Details
Motivation: 现有视觉令牌剪枝方法多为经验性,忽视了注意力机制的内在原理,导致剪枝效果受限且缺乏理论支撑。 Method: 基于注意力的对偶形式视角,将注意力建模为由各token键值对生成的秩1外积之和构成的隐式线性层;据此定义新剪枝指标,刻画单个token的信息强度与冗余程度;进而提出Progressive Chunked Maximal Marginal Relevance算法进行高效子集选择。 Result: 在多个图像与视频理解任务上验证了方法有效性,相比现有剪枝方法在保持甚至提升模型性能的同时显著降低计算开销,实现了更优的性能-效率权衡。 Conclusion: 该工作不仅提供了一种高效、免训练的视觉令牌剪枝新范式,还从对偶视角深化了对注意力机制与令牌重要性之间关系的理解,为后续剪枝与高效LVLM研究提供了新思路。 Abstract: Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.[132] PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition
Samar Ansari
Main category: cs.CV
TL;DR: 本文提出PrivHAR-Bench,一个用于评估视频动作识别中隐私-效用权衡的多层级基准数据集,涵盖从轻量级空间模糊到加密块置换等多种隐私变换,并提供配套标注、划分与评测工具。
Details
Motivation: 现有隐私保护人体活动识别(HAR)研究局限于二元对比(清晰视频 vs 单一隐私变换),难以跨方法比较,也掩盖了隐私强度与识别性能间的细粒度关系。 Method: 构建PrivHAR-Bench多层级基准:对1932个源视频施加9级递增隐私强度的视觉变换(含背景去除变体);精选15类高关节多样性动作;提供无损帧序列、逐帧检测框、置信度感知姿态关键点、标准化分组划分及统一评测工具。 Result: 在R3D-18模型上验证:同域准确率随隐私增强从88.8%(清晰)降至53.5%(加密+去背景),跨域准确率骤降至4.8%,呈现可解释的性能衰减曲线。 Conclusion: PrivHAR-Bench为隐私保护HAR方法提供了可控、标准化的评估平台,推动隐私-效用权衡的量化研究与公平比较。 Abstract: Existing research on privacy-preserving Human Activity Recognition (HAR) typically evaluates methods against a binary paradigm: clear video versus a single privacy transformation. This limits cross-method comparability and obscures the nuanced relationship between privacy strength and recognition utility. We introduce \textit{PrivHAR-Bench}, a multi-tier benchmark dataset designed to standardize the evaluation of the \textit{Privacy-Utility Trade-off} in video-based action recognition. PrivHAR-Bench applies a graduated spectrum of visual privacy transformations: from lightweight spatial obfuscation to cryptographic block permutation, to a curated subset of 15 activity classes selected for human articulation diversity. Each of the 1,932 source videos is distributed across 9 parallel tiers of increasing privacy strength, with additional background-removed variants to isolate the contribution of human motion features from contextual scene bias. We provide lossless frame sequences, per-frame bounding boxes, estimated pose keypoints with joint-level confidence scores, standardized group-based train/test splits, and an evaluation toolkit computing recognition accuracy and privacy metrics. Empirical validation using R3D-18 demonstrates a measurable and interpretable degradation curve across tiers, with within-tier accuracy declining from 88.8\% (clear) to 53.5\% (encrypted, background-removed) and cross-domain accuracy collapsing to 4.8\%, establishing PrivHAR-Bench as a controlled benchmark for comparing privacy-preserving HAR methods under standardized conditions. The dataset, generation pipeline, and evaluation code are publicly available.[133] An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models
Lennart Maack,Alexander Schlaefer
Main category: cs.CV
TL;DR: 本文提出SurgSTU-Pipeline,用于生成高质量、细粒度时空关系标注的外科视频数据集SurgSTU(7515个视频片段、15万问答样本),显著提升视觉语言模型在外科视频时空理解任务中的性能。
Details
Motivation: 现有外科视觉-语言数据集难以准确刻画和评估复杂交织的时空动态;人工标注成本高,大模型自动生成易出错。 Method: 设计确定性的SurgSTU-Pipeline,包含时间与空间连续性过滤机制,基于公开外科视频构建SurgSTU数据集,并通过零样本、上下文学习及微调方式评估VLM的时空理解能力。 Result: SurgSTU数据集含7515个视频片段和15万细粒度时空问答样本;微调后的VLM在各项时空任务中性能最优;通用VLM零样本表现差,但可通过上下文学习提升。 Conclusion: SurgSTU数据集有效提升了VLM对外科视频细粒度时空关系的理解能力,所提生成流程为外科多模态理解提供了可靠数据基础。 Abstract: Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.[134] HICT: High-precision 3D CBCT reconstruction from a single X-ray
Wen Ma,Jiaxiang Liu,Zikai Xiao,Ziyang Wang,Feng Yang,Zuozhu Liu
Main category: cs.CV
TL;DR: HiCT是一种两阶段框架,通过视频扩散模型从单张全景X光片生成多视角投影,再利用基于射线的动态注意力网络重建高保真CBCT图像,显著降低辐射剂量和成本。
Details
Motivation: CBCT虽能提供准确的3D牙科影像,但其高辐射剂量和高成本限制了临床普及;而仅用单张低剂量全景X光片重建3D体积仍面临几何不一致和精度不足的挑战。 Method: 提出HiCT两阶段框架:第一阶段使用视频扩散模型从单张全景X光生成几何一致的多视角投影;第二阶段采用基于射线的动态注意力网络和X射线采样策略,从投影重建高保真CBCT;并构建了包含500对PX-CBCT样本的大规模XCT数据集。 Result: HiCT在多项实验中达到当前最优性能,实现了高精度、几何一致的3D重建,具备临床应用潜力。 Conclusion: HiCT为低剂量、低成本的3D牙科成像提供了可行方案,推动了CBCT替代技术的发展。 Abstract: Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT's high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.[135] Multimodal Language Models Cannot Spot Spatial Inconsistencies
Om Khangaonkar,Hadi J. Rad,Hamed Pirsiavash
Main category: cs.CV
TL;DR: 本文提出了一项新任务:在双视图场景中识别违反3D运动一致性的物体,并构建了用于评估多模态大语言模型(MLLMs)空间一致性推理能力的数据生成方法;实验表明当前MLLMs在此任务上远逊于人类,揭示其对3D结构的理解仍脆弱且不完整。
Details
Motivation: 现有MLLMs虽在多模态理解上取得进展,但在跨视角3D几何推理方面表现不佳,缺乏对空间一致性的可靠建模能力,亟需更深入的物理世界建模评估方法。 Method: 提出一种简单可扩展的方法,从多视角真实场景中自动生成具有空间不一致性的图像对,用于系统性评测模型对3D运动一致性的判断能力。 Result: 最先进MLLMs在该任务上显著低于人类水平,且性能随场景属性变化波动大,表明其3D结构理解具有脆弱性和不完整性。 Conclusion: 当前MLLMs尚未建立扎实的物理世界空间理解基础,未来工作需发展更具深度和 grounded 的三维几何建模能力。 Abstract: Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.[136] Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
Kawtar Zaher,Olivier Buisson,Alexis Joly
Main category: cs.CV
TL;DR: 本文重新审视了人在回路中的对象检索任务,利用预训练的ViT表示,并探讨了多对象数据集中如何选择对象实例、标注形式、主动选择策略及最优特征表示等问题,对比了多种表示策略在全局上下文与局部细节捕捉间的权衡。
Details
Motivation: 现有方法在多物体数据集(图像中目标仅占小区域且场景复杂)中表现受限,需更适配的局部特征表示;同时缺乏对人在回路检索中关键设计问题(如实例选择、标注形式、主动采样、表示策略)的系统性探讨。 Method: 基于预训练ViT特征,构建人在回路的主动学习检索框架:将检索建模为二分类任务,通过用户相关性反馈迭代更新模型;重点研究图像中对象实例选取方式、标注形式(如边界框或掩码)、主动学习样本选择策略,以及不同特征表示(全局vs局部)对性能的影响。 Result: 在多对象数据集上系统比较了多种表示策略,揭示了全局上下文建模与细粒度局部对象特征提取之间的权衡关系,验证了适配局部表示对提升检索精度的重要性,并提供了面向交互式检索流水线的设计实践启示。 Conclusion: ViT等预训练视觉模型可有效支撑人在回路对象检索;针对多对象复杂场景,应优先采用能聚焦局部对象区域的表示策略,并结合合理的主动学习采样与轻量级用户标注形式,以实现高效、准确的交互式检索。 Abstract: Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.[137] DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Hanbing Li,Long Chen,Zhi-Xin Yang,Jiwen Lu
Main category: cs.CV
TL;DR: 本文提出Vision-Geometry-Action (VGA)范式,以稠密3D几何为关键线索替代语言描述,设计了支持在线推理的流式Driving Visual Geometry Transformer(DVGT-2),兼顾高效性与高精度几何重建,并实现跨相机配置的免调优规划泛化。
Details
Motivation: 传统端到端自动驾驶依赖稀疏感知或语言辅助,而车辆实际运行于3D空间,稠密3D几何能提供更全面的决策信息;但现有几何重建方法(如DVGT)计算开销大、不支持在线规划。 Method: 提出流式Driving Visual Geometry Transformer(DVGT-2),采用时序因果注意力机制与历史特征缓存,结合滑动窗口策略实现单帧实时输入下的稠密几何重建与轨迹规划联合输出。 Result: DVGT-2在多个数据集上几何重建性能优于现有方法,同时推理速度显著提升;无需微调即可直接适配不同相机配置,在NAVSIM(闭环)和nuScenes(开环)基准上均验证了规划有效性。 Conclusion: 稠密3D几何是比语言更本质的自动驾驶表征,DVGT-2证明了高效、在线、泛化的VGA范式可行性,为端到端驾驶提供了新技术路径。 Abstract: End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.[138] Multicentric thrombus segmentation using an attention-based recurrent network with gradual modality dropout
Sofia Vargas-Ibarra,Vincent Vigneron,Hichem Maaref,Sonia Garcia-Salicetti
Main category: cs.CV
TL;DR: 本文提出了一种结合注意力机制与循环网络(UpAttLLSTM)的渐进式多模态学习方法,用于在3D脑扫描中鲁棒地检测和分割微小、低对比度病灶(如缺血性卒中中的血栓),并能应对多中心数据中的模态缺失、各向异性和域偏移问题。
Details
Motivation: 检测和分割3D脑影像中微小目标(如血栓)存在挑战:目标尺寸小、对比度低、多模态表达不一致,且真实多中心数据存在域偏移、各向异性和序列缺失等问题。 Method: 提出UpAttLLSTM网络:基于注意力的循环分割网络,采用2.5D递归建模跨层上下文,并通过注意力门融合可用模态信息;配合渐进式训练策略——逐步增加异模态学习难度并引入模态随机丢弃,以模拟临床数据异质性,兼具数据增强与正则化作用。 Result: 单中心数据上血栓检测率>90%,Dice达0.65;多中心、模态缺失场景下检测率约80%,Dice约0.35;方法可迁移到其他稀疏、细微、模态依赖的小病灶分割任务。 Conclusion: 该方法显著提升了对微小病灶的鲁棒检测与分割能力,尤其适用于现实世界中模态不全、质量参差的多中心医学影像分析场景。 Abstract: Detecting and delineating tiny targets in 3D brain scans is a central yet under-addressed challenge in medical imaging.In ischemic stroke, for instance, the culprit thrombus is small, low-contrast, and variably expressed across modalities(e.g., susceptibility-weighted T2 blooming, diffusion restriction on DWI/ADC), while real-world multi-center dataintroduce domain shifts, anisotropy, and frequent missing sequences. We introduce a methodology that couples an attention-based recurrent segmentation network (UpAttLLSTM), a training schedule that progressively increases the difficulty of hetero-modal learning, with gradual modality dropout, UpAttLLSTM aggregates context across slices via recurrent units (2.5D) and uses attention gates to fuse complementary cues across available sequences, making it robust to anisotropy and class imbalance. Gradual modality dropout systematically simulates site heterogeneity,noise, and missing modalities during training, acting as both augmentation and regularization to improve multi-center generalization. On a monocentric cohort, our approach detects thrombi in >90% of cases with a Dice score of 0.65. In a multi-center setting with missing modalities, it achieves-80% detection with a Dice score around 0.35. Beyond stroke, the proposed methodology directly transfers to other small-lesion tasks in 3D medical imaging where targets are scarce, subtle, and modality-dependent[139] Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis
Xingxing Weng,Ruifeng Ni,Chao Pang,XiangYu Hao,Yishan Wang,Xiaokang Zhang,Wei Xu,Gui-Song Xia
Main category: cs.CV
TL;DR: 本文提出了CLeaRS基准,用于评估遥感视觉-语言模型(RS VLMs)的持续学习能力,并揭示了现有方法在任务、指令和模态增量场景中存在严重灾难性遗忘问题。
Details
Motivation: 现有遥感视觉-语言模型依赖静态训练数据,难以适应持续涌现的新感知模态和下游任务,而其持续学习能力尚未被系统研究,也缺乏专用基准。 Method: 构建了包含10个子集、20.7万图像-文本对的CLeaRS基准,涵盖多类遥感解译任务、模态与应用场景;定义了长周期、模态增量和任务增量三类评估协议;对多种视觉-语言模型及持续学习方法进行系统评测。 Result: 实验表明所有模型在各类持续学习设置下均出现显著灾难性遗忘;现有持续学习方法在适配RS VLMs时对任务、指令和模态迁移效果有限。 Conclusion: 亟需开发面向遥感视觉-语言模型特性的专用持续学习方法。 Abstract: Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.[140] Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction
Patrick Glandorf,Thomas Norrenbrock,Bodo Rosenhahn
Main category: cs.CV
TL;DR: 本文提出了一种新的视频补丁剪枝框架(VPP),利用时间先验知识在ViT的早期层中实现高效稀疏性,显著提升计算效率,尤其在高稀疏度下仍保持高性能。
Details
Motivation: 现有Patch Pruning方法仅限于深层压缩,忽略了早期阶段的压缩潜力,限制了整体效率提升;而深层特征具有强前景选择性,可为早期层提供指导。 Method: 提出一种全可微的时间映射模块,利用深层提取的先验特征,精准选择早期网络阶段中最相关的视频补丁,实现早期层的高效剪枝。 Result: 在密集预测任务中实现最高60%的补丁剪枝率,远超传统图像剪枝方法(约30%);在补丁使用率低于55%的高稀疏 regime 下,Youtube-VIS 2021 数据集上性能下降不超过0.6%。 Conclusion: VPP通过引入时间先验与可微时间映射机制,成功拓展了ViT剪枝至早期层,显著提升了视频理解模型的效率与实用性。 Abstract: Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.[141] LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation
Patrick Amadeus Irawan,Erland Hilman Fuadi,Shanu Kumar,Alham Fikri Aji,Yova Kementchedjhieva
Main category: cs.CV
TL;DR: 本文提出LinguDistill方法,在不增加额外模块的前提下,通过冻结原始语言模型作为教师,利用层间KV缓存共享实现视觉条件下的知识蒸馏,从而恢复多模态模型在语言任务上的性能损失,同时保持其视觉能力。
Details
Motivation: 预训练语言模型适配为视觉-语言模型后,常因表征偏移和跨模态干扰导致原生语言能力下降,且难以通过常规微调恢复;已有恢复方法需引入额外模块,带来架构复杂、参数增加和泛化受限等问题。 Method: 提出无适配器的知识蒸馏方法LinguDistill:以冻结的原始LM为教师,通过层-wise KV-cache共享机制使教师感知学生多模态表征,进而在语言密集型数据上选择性蒸馏语言能力,保留学生在多模态任务中的视觉接地能力。 Result: LinguDistill在语言与知识基准测试中恢复约10%的性能损失,同时在视觉主导任务上保持可比性能。 Conclusion: 无需额外模块即可有效恢复多模态模型的语言能力,为模态特异性退化问题提供了高效实用的解决方案。 Abstract: Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.[142] Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
Shuang Li,Chao Deng,Hang Chen,Liqun Liu,Zhenyu Hu,Te Cao,Mengge Xue,Yuan Chen,Peng Shu,Huan Yu,Jie Jiang
Main category: cs.CV
TL;DR: 本文提出DisCo框架,通过解耦和再耦合视觉与文本信息来解决主体驱动文本到图像生成中的相似性-可控性悖论,实现了高保真主体保留与精确文本控制的兼顾。
Details
Motivation: 解决主体驱动文本到图像生成中因文本提示同时描述主体和修改而导致的相似性与可控性难以兼顾的悖论。 Method: 提出DisCo框架:首先通过文本-视觉解耦模块将主体身份信息(来自参考图像和实体词)与修改指令(简化文本提示,仅含动作,主体用泛指代词)分离;再通过设计专用奖励信号并结合强化学习实现视觉主体与文本上下文的自然再耦合。 Result: 在多个基准上达到最优性能,生成图像兼具高主体保真度与强文本可控性,且更真实、连贯。 Conclusion: DisCo有效破解了相似性-可控性悖论,为高质量主体驱动T2I生成提供了新范式。 Abstract: Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the "similarity-controllability paradox", where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.[143] MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer
Samuel Teodoro,Yun Chen,Agus Gunawan,Soo Ye Kim,Jihyong Oh,Munchurl Kim
Main category: cs.CV
TL;DR: 本文提出了MotionGrounder,一种基于扩散Transformer(DiT)的多物体可控运动迁移框架,通过Flow-based Motion Signal(FMS)提供稳定运动先验,并引入Object-Caption Alignment Loss(OCAL)实现物体与文本的空间对齐,还提出Object Grounding Score(OGS)综合评估空间对齐与语义一致性。
Details
Motivation: 现有基于DiT的运动迁移方法仅支持单物体视频,难以在含多个物体的真实场景中实现细粒度控制。 Method: 提出MotionGrounder框架,包含Flow-based Motion Signal(FMS)作为运动先验、Object-Caption Alignment Loss(OCAL)实现物体-文本空间对齐,并设计Object Grounding Score(OGS)联合评估空间对齐和语义一致性。 Result: MotionGrounder在定量、定性和人工评估中均持续优于近期基线方法。 Conclusion: MotionGrounder首次实现了DiT框架下的多物体可控运动迁移,显著提升了复杂场景中运动迁移的可控性与生成质量。 Abstract: Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.[144] Perturb-and-Restore: Simulation-driven Structural Augmentation Framework for Imbalance Chromosomal Anomaly Detection
Yilan Zhang,Hanbiao Chen,Changchun Yang,Yuetan Chu,Siyuan Chen,Jing Wu,Jingdong Hu,Na Li,Junkai Su,Yuxuan Chen,Ao Xu,Xin Gao,Aihua Yin
Main category: cs.CV
TL;DR: 本文提出了一种名为Perturb-and-Restore(P&R)的模拟驱动结构增强框架,用于缓解染色体异常检测中因数据稀缺和类别不平衡导致的深度学习性能下降问题。该框架通过扰动-恢复合成异常染色体图像,并结合能量引导的自适应采样策略,显著提升了检测性能。
Details
Motivation: 临床中难以收集足够且多样的结构异常染色体数据,导致深度学习模型在染色体异常检测中因数据稀缺与严重类别不平衡而性能下降。 Method: 提出Perturb-and-Restore(P&R)框架,包含两部分:(1) 结构扰动与恢复模拟——对正常染色体带型进行扰动,并利用扩散网络重建连续染色体内容与边缘;(2) 能量引导的自适应采样——基于真实样本能量分布动态筛选高质量合成样本。 Result: 在涵盖26万余张染色体图像(含4242例24类异常)的综合数据集上,P&R达到SOTA性能,平均敏感性、精确率和F1分数分别提升8.92%、8.89%和13.79%。 Conclusion: P&R框架有效缓解了染色体结构异常检测中的数据稀缺与不平衡问题,无需依赖稀有异常样本即可生成高质量合成数据,显著提升模型性能,具备临床实用潜力。 Abstract: Detecting structural chromosomal abnormalities is crucial for accurate diagnosis and management of genetic disorders. However, collecting sufficient structural abnormality data is extremely challenging and costly in clinical practice, and not all abnormal types can be readily collected. As a result, deep learning approaches face significant performance degradation due to the severe imbalance and scarcity of abnormal chromosome data. To address this challenge, we propose a Perturb-and-Restore (P&R), a simulation-driven structural augmentation framework that effectively alleviates data imbalance in chromosome anomaly detection. The P&R framework comprises two key components: (1) Structure Perturbation and Restoration Simulation, which generates synthetic abnormal chromosomes by perturbing chromosomal banding patterns of normal chromosomes followed by a restoration diffusion network that reconstructs continuous chromosome content and edges, thus eliminating reliance on rare abnormal samples; and (2) Energy-guided Adaptive Sampling, an energy score-based online selection strategy that dynamically prioritizes high-quality synthetic samples by referencing the energy distribution of real samples. To evaluate our method, we construct a comprehensive structural anomaly dataset consisting of over 260,000 chromosome images, including 4,242 abnormal samples spanning 24 categories. Experimental results demonstrate that the P&R framework achieves state-of-the-art (SOTA) performance, surpassing existing methods with an average improvement of 8.92% in sensitivity, 8.89% in precision, and 13.79% in F1-score across all categories.[145] Sparkle: A Robust and Versatile Representation for Point Cloud based Human Motion Capture
Yiming Ren,Yujing Sun,Aoru Xue,Kwok-Yan Lam,Yuexin Ma
Main category: cs.CV
TL;DR: 本文提出Sparkle,一种结合骨骼关节与表面锚点的结构化表示方法,并通过SparkleMotion框架实现运动捕捉,兼顾表达力与鲁棒性。
Details
Motivation: 点云运动捕捉虽具几何丰富性和隐私保护优势,但其噪声大、无结构的特点使学习鲁棒表征困难;现有方法在点基(细节多但噪声大)与骨架基(鲁棒但过度简化)间难以兼顾。 Method: 提出Sparkle结构化表示,统一骨骼关节与表面锚点,并进行运动学-几何显式解耦;设计SparkleMotion框架,通过分层模块嵌入几何连续性与运动学约束。 Result: 在精度、鲁棒性及跨域泛化能力(如严重噪声、遮挡、传感器差异)上达到SOTA,实验覆盖多种传感器与真实场景。 Conclusion: 显式解耦内部运动学结构与外部表面几何可有效提升点云运动捕捉的表达力与鲁棒性,Sparkle为该任务提供新范式。 Abstract: Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness. In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.[146] Shape Representation using Gaussian Process mixture models
Panagiotis Sapoutzoglou,George Terzakis,Georgios Floros,Maria Pateraki
Main category: cs.CV
TL;DR: 本文提出了一种基于高斯过程(GP)混合模型的轻量级、对象特定的功能性三维形状表示方法,用于从稀疏点云学习连续的方向距离场,能高效准确地表示复杂几何结构。
Details
Motivation: 传统显式3D表示(如点云、网格)存储开销大、索引复杂,而功能性表示具有紧凑、连续、高效的优势。 Method: 采用高斯过程混合模型建模表面几何,通过在策略性参考点(如骨架或聚类得到)锚定局部GP先验,从稀疏点云学习连续方向距离场,无需重型神经网络。 Result: 在ShapeNetCore和IndustryShapes数据集上的大量实验表明,该方法能高效、准确地表示复杂几何结构。 Conclusion: 所提GP混合模型方法是一种轻量、灵活且有效的对象特定功能性3D形状表示方案,兼顾精度、效率与拓扑表达能力。 Abstract: Traditional explicit 3D representations, such as point clouds and meshes, demand significant storage to capture fine geometric details and require complex indexing systems for surface lookups, making functional representations an efficient, compact, and continuous alternative. In this work, we propose a novel, object-specific functional shape representation that models surface geometry with Gaussian Process (GP) mixture models. Rather than relying on computationally heavy neural architectures, our method is lightweight, leveraging GPs to learn continuous directional distance fields from sparsely sampled point clouds. We capture complex topologies by anchoring local GP priors at strategic reference points, which can be flexibly extracted using any structural decomposition method (e.g. skeletonization, distance-based clustering). Extensive evaluations on the ShapeNetCore and IndustryShapes datasets demonstrate that our method can efficiently and accurately represent complex geometries.[147] A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video
Maximilian Fehrentz,Nicolas Stellwag,Robert Wiebe,Nicole Thorisch,Fabian Grob,Patrick Remerscheid,Ken-Joel Simmoteit,Benjamin D. Killeen,Christian Heiliger,Nassir Navab
Main category: cs.CV
TL;DR: 本文提出了一种基于显式4D表示的手术智能体框架,通过融合点跟踪、深度和分割等3D视觉模型构建时空一致的4D表征,并利用无需微调的多模态大语言模型(MLLM)进行自然语言推理,显著提升了软组织手术中的时空理解与4D定位能力。
Details
Motivation: 软组织手术中时空推理是AI实现智能辅助和自主机器人的重要基础,而现有2D视觉语言模型难以应对手术场景的空间复杂性,亟需引入显式的4D表示以增强时空感知与推理能力。 Method: 构建一个基于点跟踪、深度估计和语义分割的显式4D表征框架,生成时空一致的器械与组织语义;在此基础上,将4D衍生工具(如轨迹)输入通用多模态大语言模型(MLLM)作为推理代理,全程无需微调。 Result: 在包含134个临床相关问题的新数据集上验证,该方法显著提升时空理解与4D接地能力;证明可仅由通用2D MLLM与3D视觉模型‘组装’出时空智能,无需额外训练。 Conclusion: 显式4D表征能有效桥接2D语言推理与3D手术场景,为手术AI提供可解释、可接地的时空推理能力,且具备模块化、免训练的实用优势。 Abstract: Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be "assembled" from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at https://tum-ai.github.io/surg4d/[148] PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
Nan Wang,Zhiwei Jin,Chen Chen,Haonan Lu
Main category: cs.CV
TL;DR: 本文提出PixelPrune方法,通过预测编码压缩图像中像素级冗余,在ViT编码器前剔除重复图像块,实现无训练、零参数的像素无损/可控有损压缩,显著加速文档与GUI理解任务的推理与训练过程。
Details
Motivation: 文档理解和GUI交互等高价值视觉语言任务需高分辨率输入,导致大量视觉token和巨大计算开销;而实际图像中存在大量像素级重复块(22%-71%),造成计算浪费。 Method: PixelPrune基于预测编码,在像素空间中识别并剪枝重复图像块,于ViT编码器前完成压缩;支持无损(τ=0)和可控有损(τ>0)模式,无需训练或可学习参数。 Result: 在三类模型规模及文档/GUI基准上,PixelPrune在保持竞争力任务精度的同时,实现最高4.2×推理加速和1.9×训练加速。 Conclusion: PixelPrune是一种高效、通用、即插即用的预处理压缩方法,有效缓解VLM在高分辨率视觉理解任务中的计算瓶颈。 Abstract: Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.[149] Adversarial Attenuation Patch Attack for SAR Object Detection
Yiming Zhang,Weibo Qin,Feng Wang
Main category: cs.CV
TL;DR: 本文提出了一种面向SAR目标检测的物理可实现对抗攻击方法——对抗衰减补丁(AAP),通过能量约束优化与基于衰减的部署框架,在攻击有效性与隐蔽性之间取得平衡,并契合电子干扰机制,具备实际部署潜力。
Details
Motivation: 现有SAR专用对抗攻击方法扰动明显、局限于数字域,忽视了对SAR系统的物理实现约束。 Method: 提出Adversarial Attenuation Patch(AAP)方法,采用能量约束优化策略和基于衰减的部署框架,模拟信号级电子干扰机制。 Result: AAP在显著降低检测性能的同时保持高不可感知性,并在不同模型间展现出良好迁移性。 Conclusion: 该研究为SAR目标检测系统提供了物理可落地的对抗攻击新视角,推动更隐蔽、更实用的攻击策略设计。 Abstract: Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.[150] IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off
Linyan Dai,Xinwei Zhang,Haoyang Li,Qingqing Ye,Haibo Hu
Main category: cs.CV
TL;DR: 本文提出IDDM方法,在个性化文本到图像扩散模型中实现模型侧输出免疫,通过身份解耦降低生成图像的身份可链接性,同时保持高质量生成效果。
Details
Motivation: 现有防御方法仅关注防止未经授权的个性化,但无法解决授权个性化后公开生成内容仍泄露身份信息的问题。 Method: 提出Identity-Decoupled个性化扩散模型(IDDM),在个性化流程中集成身份解耦,采用交替更新与身份解耦数据优化的两阶段调度策略。 Result: 在多个数据集、多样化提示及先进面部识别系统上的实验表明,IDDM能持续降低身份可链接性,同时保持高保真个性化生成质量。 Conclusion: IDDM为授权个性化场景下的隐私保护提供了新范式,实现了可调节的隐私-效用权衡。 Abstract: Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.[151] JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation
Issa Sugiura,Koki Maeda,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Naoaki Okazaki
Main category: cs.CV
TL;DR: 本文提出JAMMEval,一个经过系统人工标注优化的日本视觉语言模型(VLM)评测基准集,旨在解决现有日语VQA数据集存在的模糊问题、错误答案及无需视觉即可作答等可靠性缺陷。实验表明其提升了评测准确性、稳定性与模型区分能力,并开源数据与代码。
Details
Motivation: 现有日语VQA基准缺乏迭代优化,存在模糊问题、错误答案和无需视觉即可解答的样本,导致评测不可靠、模型比较结论失真。 Method: 通过两轮人工标注,系统性地优化七个现有日语基准数据集,构建高质量、高可靠性的JAMMEval评测集合。 Result: JAMMEval在开放权重与闭源VLM评测中展现出更低的运行间方差、更强的模型能力反映能力及更优的模型区分度。 Conclusion: JAMMEval显著提升了日语VLM评测的可靠性与有效性,为后续研究提供了更可信的评估基础。 Abstract: Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.[152] Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment
Zhuchenyang Liu,Yao Zhang,Yu Xiao
Main category: cs.CV
TL;DR: 本文构建了IKEA-Bench基准,系统评估视觉语言模型(VLMs)在2D装配图与真实视频对齐任务中的表现,发现视觉编码是跨表征鲁棒性的主要瓶颈,文本虽有助于理解指令却削弱图-视对齐,且架构家族比参数量更能预测对齐性能。
Details
Motivation: 2D装配图抽象难懂,需智能助手在混合现实中实时监控装配进度、检测错误并提供指导;但现有VLM面临装配图与视频帧间‘表征鸿沟’(depiction gap),缺乏系统性评估。 Method: 构建包含29种宜家家具、1623个问题的IKEA-Bench基准,涵盖6类任务;在三种对齐策略下评测19个不同规模(2B–38B)的VLM;结合三级机制分析(行为、表征、推理)探究图-视-文三者交互。 Result: (1)文本可恢复指令理解,却损害图-视对齐;(2)VLM架构族比参数量更决定对齐精度;(3)视频理解仍是未被策略缓解的硬瓶颈;(4)图与视频在ViT特征空间中分布分离,加文本促使模型转向文本驱动推理。 Conclusion: 提升跨表征鲁棒性的关键在于改进视觉编码器,而非单纯扩大模型规模或依赖文本提示。 Abstract: 2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/[153] ProCap: Projection-Aware Captioning for Spatial Augmented Reality
Zimo Cao,Yuchen Deng,Haibin Ling,Bingyao Huang
Main category: cs.CV
TL;DR: 本文提出ProCap框架,通过视觉分离虚拟投影与物理场景,并利用区域感知检索解决投影失真带来的语义模糊问题;同时构建首个大规模SAR语义基准数据集RGBP,并设计双标注评估协议,为智能空间增强现实提供语义基础。
Details
Motivation: 标准视觉语言模型在空间增强现实中难以区分物理场景与投影内容,导致虚拟-物理语义混淆,限制了智能交互能力。 Method: ProCap采用两阶段流程:首先通过自动分割视觉上分离虚拟层和物理层;其次利用区域感知检索避免因投影失真引起的语义歧义;并构建RGBP数据集(含65个场景、18万+投影及解耦密集标注),提出基于任务特定标记的双标注评估协议。 Result: 实验表明ProCap为SAR研究提供了稳健的语义基础;代码、预训练模型和RGBP数据集已开源。 Conclusion: ProCap有效解决了SAR中虚拟与物理内容的语义解耦难题,推动了具备语义理解能力的空间增强现实系统发展。 Abstract: Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.[154] Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis
Dylan B. Lewis,Jens Gregor,Hector Santos-Villalobos
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的后处理方法CCA操作符,利用两个预训练图像编码器表征间的共享结构进行线性投影,实现高效表征选择与降维,在大幅降低维度的同时提升下游任务性能。
Details
Motivation: 预训练图像编码器的表征常冗余且模型特异,需更高效、通用的表征压缩与优化方法。 Method: 采用后验典型相关分析(CCA)作为无训练的后处理算子,基于两个预训练编码器表征间的跨模型一致性,学习线性投影以保留语义共享内容、剔除冗余维度。 Result: 在ImageNet-1k、CIFAR-100、MNIST等多个基准上,相比基线和PCA投影表征,准确率最高提升12.6%;维度可减少超75%且性能反升,或在固定维度下通过大模型/微调模型迁移增强表征。 Conclusion: CCA作为一种简单、无训练、跨模型引导的表征蒸馏方法,能有效提升图像表征效率与下游性能,优于单模型降维方法如PCA。 Abstract: Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.[155] Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting
Arina Kharlamova,Bowei He,Chen Ma,Xue Liu
Main category: cs.CV
TL;DR: DANCEMATCH是一个端到端的舞蹈动作检索框架,通过离散化运动签名实现高效、可解释、可扩展的舞蹈指纹识别。
Details
Motivation: 现有动作分析与检索方法依赖难以索引、解释和扩展的连续嵌入,而舞蹈检索需支持大规模、语义一致且可解释的匹配。 Method: 提出DANCEMATCH框架,结合骨架运动量化(SMQ)与时空Transformer(STT)生成离散运动签名;设计舞蹈检索引擎(DRE),采用直方图索引加重排序实现亚线性检索;使用Apple CoMotion提取姿态,并发布带量化动作标记的DANCETYPESBENCHMARK数据集。 Result: 实验表明该方法在多种舞蹈风格中具有鲁棒检索性能,并能良好泛化至未见编舞,支持大规模动作指纹识别与定量编舞分析。 Conclusion: DANCEMATCH为可扩展的动作指纹识别与编舞量化分析奠定了新基础,解决了传统连续嵌入在索引性、可解释性和可扩展性上的瓶颈。 Abstract: We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.[156] Autoregressive Appearance Prediction for 3D Gaussian Avatars
Michael Steiner,Zhang Chen,Alexander Richard,Vasu Agrawal,Markus Steinberger,Michael Zollhöfer
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯泼溅和空间MLP的条件化人像建模方法,通过引入可学习的外观隐变量并结合姿态条件,提升重建质量与驱动稳定性。
Details
Motivation: 现有方法在使用大规模高质数据集时易因相似姿态对应不同外观而产生歧义和伪相关,导致过拟合和新姿态下外观突变。 Method: 提出基于3D高斯泼溅的avatar模型,采用空间MLP主干网络,以姿态和外观隐变量为条件;隐变量由编码器学习获得,并在驱动阶段通过自回归预测器实现时序平滑演化。 Result: 显著提升重建质量、外观一致性与时序稳定性,实现高保真且鲁棒的人像驱动效果。 Conclusion: 该方法为高保真人像建模与稳定驱动提供了一条实用且鲁棒的技术路径。 Abstract: A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.[157] EmoScene: A Dual-space Dataset for Controllable Affective Image Generation
Li He,Longtai Zhang,Wenqiang Zhang,Yan Wang,Lizhe Qi
Main category: cs.CV
TL;DR: 本文提出EmoScene数据集,旨在通过联合编码情感维度(VAD)与感知属性,提升文本到图像扩散模型对场景语义和情感意图的精细控制能力。
Details
Motivation: 现有文本到图像扩散模型难以精确控制场景语义和细粒度情感基调,因其缺乏将情感因素与感知线索统一建模的能力。 Method: 构建大规模双空间情感数据集EmoScene(含120万图像),标注离散情绪标签、连续VAD值、感知描述符和文本描述;设计轻量级基线模型,通过浅层跨注意力调制在冻结扩散主干中注入双空间控制。 Result: EmoScene揭示了离散情绪在VAD空间中的分布规律及情感与场景级感知因素的系统性关联;基线模型验证了双空间监督可有效提升情感可控性。 Conclusion: EmoScene为情感可控图像生成提供了新基准与数据基础,推动文本到图像模型向更符合人类视觉情感认知的方向发展。 Abstract: Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.[158] YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
Miro Miranda,Deepak Pathak,Patrick Helber,Benjamin Bischke,Hiba Najjar,Francisco Mena,Cristhian Sanchez,Akshay Pai,Diego Arenas,Matias Valdenegro-Toro,Marcela Charfuelan,Marlon Nuske,Andreas Dengel
Main category: cs.CV
TL;DR: 本文发布了YieldSAT——一个大规模、高质量、多模态的高分辨率作物产量预测数据集,覆盖多国多气候带及多种主要作物,并提出基于像素回归的深度学习建模方法与域感知深度集成策略以应对真实场景中的分布偏移问题。
Details
Motivation: 现有作物产量预测数据集受限于采集成本高、数据质量参差不齐及隐私法规,导致数据稀缺、质量低或地域/作物类型受限,阻碍了可扩展数据驱动方案的发展。 Method: 构建YieldSAT多模态数据集(含1220万10米分辨率产量样本、11.3万幅多光谱卫星影像及环境辅助数据),将产量预测建模为像素级回归任务,对比多种深度学习模型与数据融合架构,并提出域感知深度集成方法应对分布偏移。 Result: 验证了大规模高分辨率产量预测的可行性;所提域感知深度集成方法显著提升性能;揭示了真实场景中地面真值分布偏移带来的关键挑战。 Conclusion: YieldSAT为作物产量预测提供了首个大规模、跨区域、多作物、高分辨率基准数据集,推动了可扩展、鲁棒的数据驱动农业建模研究。 Abstract: Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.[159] Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization
Hao Fang,Wenbo Yu,Bin Chen,Xuan Wang,Shu-Tao Xia,Qing Liao,Ke Xu
Main category: cs.CV
TL;DR: 本文提出GIFD方法,通过在GAN特征域中进行梯度反演,提升联邦学习中隐私数据重建的精度和泛化性,并支持分布外(OOD)场景及标签不一致问题。
Details
Motivation: 现有联邦学习中的梯度反演攻击仅在GAN隐空间优化,表达能力与泛化性受限;且假设GAN训练数据与FL任务数据同分布,实际中常不成立。 Method: 提出GIFD方法:将GAN拆解,逐层搜索中间特征;从隐空间逐步过渡到靠近输出的层进行优化;引入l1球约束正则项防止生成不真实图像;扩展至OOD场景,并设计标签映射技术解决标签不一致问题。 Result: 实验表明GIFD可在多种FL场景下实现像素级数据重建,性能优于现有基线方法。 Conclusion: GIFD通过分层特征优化与正则化设计,显著提升了梯度反演攻击的有效性、鲁棒性与适用范围,尤其在OOD和标签不一致等现实挑战下仍保持高性能。 Abstract: Federated Learning (FL) has emerged as a compelling paradigm for privacy-preserving distributed machine learning, allowing multiple clients to collaboratively train a global model by transmitting locally computed gradients to a central server without exposing their private data. Nonetheless, recent studies find that the gradients exchanged in the FL system are also vulnerable to privacy leakage, e.g., an attacker can invert shared gradients to reconstruct sensitive data by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, existing attacks simply perform gradient inversion in the latent space of the GAN model, which limits their expression ability and generalizability. To tackle these challenges, we propose \textbf{G}radient \textbf{I}nversion over \textbf{F}eature \textbf{D}omains (GIFD), which disassembles the GAN model and searches the hierarchical features of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Furthermore, we consider the challenging OOD scenario of label inconsistency and propose a label mapping technique as an effective solution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and outperform competitive baselines across a variety of FL scenarios.[160] DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Yiyao Zhu,Ying Xue,Haiming Zhang,Guangfeng Jiang,Wending Zhou,Xu Yan,Jiantao Gao,Yingjie Cai,Bingbing Liu,Zhen Li,Shaojie Shen
Main category: cs.CV
TL;DR: 本文提出DLWM,一种双潜在世界模型的新范式,用于基于高斯中心的自动驾驶预训练,通过两阶段自监督学习实现3D占用感知、4D占用预测和运动规划的显著性能提升。
Details
Motivation: 视觉自主驾驶因低成本和优异性能受到关注,但现有密集BEV或稀疏查询模型存在局限,需要一种更综合且稀疏的表示方法——高斯中心方法。 Method: DLWM采用双阶段设计:第一阶段通过自监督重建多视角语义与深度图像来预测3D高斯;第二阶段分别训练两个潜在世界模型,一个用于高斯流引导的下游占用感知与预测,另一个用于自车规划引导的运动规划。 Result: 在SurroundOcc和nuScenes基准上的大量实验表明,DLWM在高斯中心的3D占用感知、4D占用预测及运动规划任务中均取得显著性能提升。 Conclusion: DLWM为高斯中心的自动驾驶预训练提供了新范式,验证了双潜在世界模型在多任务协同学习中的有效性与潜力。 Abstract: Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.[161] ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration
Bei Yan,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理干预方法ACT,通过自适应上下文整合缓解大视觉语言模型(LVLMs)的幻觉问题,包含视觉上下文探索与语义上下文聚合两个核心机制。
Details
Motivation: 现有LVLM幻觉缓解方法多依赖静态、单步状态,忽视生成过程中动态上下文变化,难以纠正信息丢失。 Method: 提出ACT方法:1)视觉上下文探索——利用时空特征分析,自适应增强负责视觉探索的注意力头;2)语义上下文聚合——对潜在语义查询进行边缘化,有效聚合视觉证据以缓解离散token预测导致的信息损失。 Result: 在多种LVLM上实验表明,ACT显著降低幻觉,在判别式和生成式基准上均取得有竞争力的结果,且不损害基础生成能力。 Conclusion: ACT是一种鲁棒、高度自适应、无需训练的推理干预方案,为LVLM幻觉缓解提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.[162] Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging
Weixi Yi,Yipei Wang,Wen Yan,Hanyuan Zhang,Natasha Thorley,Alexander Ng,Shonit Punwani,Fernando Bianco,Mark Emberton,Veeru Kasivisvanathan,Dean C. Barratt,Shaheer U. Saeed,Yipeng Hu
Main category: cs.CV
TL;DR: 本文提出一种仅使用T2加权MRI图像进行前列腺癌局部定位的新方法,利用训练阶段可获得的扩散加权成像(DWI)作为潜在特权模态,通过期望最大化框架结合流匹配生成模型与癌症定位器联合优化,显著提升了基于组织病理学标签的癌症定位性能。
Details
Motivation: 现有方法多依赖多模态MRI(如T2w+DWI)进行前列腺癌检测,而仅用T2w图像虽具临床成本与操作优势,但在以独立组织病理学标签为金标准的癌症定位任务中仍具挑战;本文旨在探索如何在推理时仅需T2w、训练时利用DWI作为特权信息来提升定位精度。 Method: 提出一种基于期望最大化(EM)的框架:E步中用流匹配生成模型近似潜在DWI图像的后验分布;M步中同步优化癌症定位器和生成模型,以最大化癌症存在概率的期望似然;将DWI视为训练可用但推理不可用的特权模态。 Result: 在4133例患者(含内部与外部数据集)的组织病理学验证实验中,所提T2-only方法在患者级F1分数上比T2w+DWI基线提升14.4%,区域级加权Kappa(QWK)提升5.3%;性能优于无DWI训练或现有特权学习方法。 Conclusion: 该工作为特权模态学习提供了新理论框架,证明即使仅在训练中利用DWI,也能显著增强纯T2w图像的癌症局部化能力,具备临床落地潜力。 Abstract: Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4\% and zone-level QWK by 5.3\% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.[163] Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise
Jiacheng Liao,Feng Qian,Ziyin Fan,Yongjian Guo
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的语义引导地震面波压制框架,利用可提示大视觉模型提取语义先验生成软掩模,并嵌入低秩反演模型中实现自适应压制与反射信号保真重建。
Details
Motivation: 传统面波压制方法(如变换域滤波、稀疏表示、深度学习)存在适应性差、信号泄漏或依赖标注数据等问题,尤其在强信号-噪声重叠时表现不佳。 Method: 将面波压制重构为语义引导的信号分离问题;使用可提示大视觉模型将地震道集转为视觉表征并定位面波主导区域;生成连续软掩模,嵌入掩模条件下的低秩反演模型;采用ADMM优化求解。 Result: 在合成与实际VSP数据上实验表明,该方法在面波压制效果、反射连续性和波形保真度方面均优于典型变换域滤波和隐式神经表示方法。 Conclusion: 所提训练免费、无需人工标注的语义引导框架,兼具物理一致性与强适应性,为地震数据去噪提供了新范式。 Abstract: Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.[164] EgoSim: Egocentric World Simulator for Embodied Interaction Generation
Jinkun Hao,Mingda Jia,Ruiyan Wang,Xihui Liu,Ran Yi,Lizhuang Ma,Jiangmiao Pang,Xudong Xu
Main category: cs.CV
TL;DR: EgoSim是一种闭环的以自我为中心的世界模拟器,通过建模可更新的3D世界状态,实现空间一致的交互视频生成和持续场景状态更新,解决了现有方法缺乏显式3D接地或场景静态化的问题。
Details
Motivation: 现有以自我为中心的模拟器要么缺乏显式3D接地导致视角变化下结构漂移,要么将场景视为静态而无法更新多阶段交互中的世界状态。 Method: EgoSim建模3D场景为可更新的世界状态;提出几何-动作感知观测模拟模型生成具身交互,并通过交互感知状态更新模块保证空间一致性;设计可扩展的数据提取流水线,并引入低成本EgoCap真实数据采集系统。 Result: 实验表明EgoSim在视觉质量、空间一致性、复杂场景及野外灵巧交互泛化能力上显著优于现有方法,并支持跨具身迁移至机器人操作。 Conclusion: EgoSim通过联合建模动态3D世界状态与具身交互,有效克服了现有 egocentric 模拟器在结构一致性和场景演化建模上的根本局限,为具身AI提供了更真实、可持续的仿真基础。 Abstract: We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.[165] Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
Yiheng Wang,Lichen Zhu,Yueqian Lin,Yudong Liu,Jingyang Zhang,Hai "Helen" Li,Yiran Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于信息瓶颈理论的证据驱动关键帧采样框架,用于提升长视频理解中多模态大语言模型(MLLMs)的性能。通过最大化所选帧与查询之间的条件互信息,将关键帧选择建模为可分解的帧级打分问题,并设计了一个查询条件化的对比学习证据评分网络,在严格token预算下显著优于现有方法且训练更高效。
Details
Motivation: 现有关键帧采样方法在长视频理解中存在局限:基于语义相关性的方法难以捕捉证据性线索,而基于强化学习的方法优化效率低、组合搜索困难;同时,MLLMs受限于上下文长度和计算开销,亟需更高效、有原则的关键帧选择机制。 Method: 提出基于信息瓶颈理论的关键帧采样框架,将关键帧选择建模为最大化条件互信息(CMI)的优化问题;推导出可分解形式,实现帧级独立打分;设计查询条件化的证据评分网络,采用对比学习目标进行训练。 Result: 在多个长视频理解基准上,该方法在严格token预算约束下持续超越先前采样策略,并显著提升训练效率。 Conclusion: 证据驱动、信息论引导的关键帧采样是一种更有效、更可解释的长视频关键信息提取范式,为MLLMs处理长视频提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.[166] PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks
Jingning Xu,Haochen Luo,Chen Liu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉-语言模型(VLM)鲁棒性防御框架PDA,通过测试时的文本增强(提示改写、问题分解与一致性聚合)提升对多种图像对抗攻击的鲁棒性,不修改模型且兼顾效率与性能。
Details
Motivation: 现有基于对抗训练的VLM防御方法计算开销大、泛化能力差,难以应对未知攻击类型。 Method: 提出Paraphrase-Decomposition-Aggregation(PDA)框架,在测试阶段进行提示语义改写、问题结构分解和多路径响应一致性聚合;进一步设计轻量级不变量形式以降低推理开销。 Result: 在多个VLM架构及VQA、分类、图像描述等任务上,PDA显著提升对各类图像对抗扰动的鲁棒性,同时保持有竞争力的干净样本准确率。 Conclusion: PDA是一种通用、强健且实用的训练无关型推理时防御框架,为VLM鲁棒性提供了新思路。 Abstract: Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.[167] Forecasting Motion in the Wild
Neerja Thakkar,Shiry Ginosar,Jacob Walker,Jitendra Malik,Joao Carreira,Carl Doersch
Main category: cs.CV
TL;DR: 本文提出了一种基于密集点轨迹的视觉token表示方法,用于建模和预测非刚性动物的行为运动,并设计了能处理遮挡的扩散Transformer模型,在大规模野外动物视频数据上验证了其泛化性和预测性能。
Details
Motivation: 视觉智能需要预测智能体未来行为,但现有视觉系统缺乏对运动与行为的通用表征。 Method: 提出密集点轨迹作为行为的视觉token,构建能建模无序轨迹集并显式推理遮挡的扩散Transformer模型,并构建300小时经镜头检测与相机运动补偿的野外动物视频数据集。 Result: 在轨迹token预测任务中实现类别无关、数据高效、优于SOTA基线的性能,并能泛化至稀有物种与形态。 Conclusion: 密集点轨迹是一种有效的中层行为表征,为野外环境下的预测性视觉智能提供了新基础。 Abstract: Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.[168] Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization
Yueh-Cheng Liu,Jozef Hladký,Matthias Nießner,Angela Dai
Main category: cs.CV
TL;DR: 本文提出Diff3R框架,通过在训练中嵌入可微的3D高斯泼溅(3DGS)优化层,使前馈网络能预测更优的测试时优化初始值,并利用隐函数定理与PCG求解器高效反向传播,结合数据驱动不确定性建模提升鲁棒性与泛化性。
Details
Motivation: 现有3DGS方法中,前馈模型推理快但质量低,逐场景优化质量高但计算昂贵;亟需融合二者优势。 Method: 提出Diff3R框架:1)将可微3DGS优化层嵌入训练循环;2)用隐函数定理和矩阵自由PCG求解器实现高效梯度计算;3)引入数据驱动不确定性模型自适应约束参数更新幅度。 Result: 显著提升测试时优化的收敛速度与渲染质量,增强对稀疏视角、姿态误差及输入异常值的鲁棒性;且该优化层可即插即用地提升多种现有前馈3DGS架构(含pose-given与pose-free)。 Conclusion: Diff3R成功桥接了前馈预测与测试时优化,在保持效率的同时逼近逐场景优化质量,为3D重建提供了更实用、鲁棒与通用的新范式。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) present two main directions: feed-forward models offer fast inference in sparse-view settings, while per-scene optimization yields high-quality renderings but is computationally expensive. To combine the benefits of both, we introduce Diff3R, a novel framework that explicitly bridges feed-forward prediction and test-time optimization. By incorporating a differentiable 3DGS optimization layer directly into the training loop, our network learns to predict an optimal initialization for test-time optimization rather than a conventional zero-shot result. To overcome the computational cost of backpropagating through the optimization steps, we propose computing gradients via the Implicit Function Theorem and a scalable, matrix-free PCG solver tailored for 3DGS optimization. Additionally, we incorporate a data-driven uncertainty model into the optimization process by adaptively controlling how much the parameters are allowed to change during optimization. This approach effectively mitigates overfitting in under-constrained regions and increases robustness against input outliers. Since our proposed optimization layer is model-agnostic, we show that it can be seamlessly integrated into existing feed-forward 3DGS architectures for both pose-given and pose-free methods, providing improvements for test-time optimization.[169] Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using Open-Source Photogrammetry
Aaranay Aadi,Jai Singla,Nitant Dube,Oleg Alexandrov
Main category: cs.CV
TL;DR: 本文首次利用Chandrayaan-2轨道器高分辨率相机(OHRC)的多视角影像,通过全开源流程生成亚米级月球数字高程模型(DEM),在五个月面地点实现24–54 cm空间分辨率,并通过ICP配准与NAC参考数据验证,垂直RMSE为5.85 m,平面精度优于30 cm。
Details
Motivation: 高分辨率月面数字高程模型(DEM)对月球表面移动规划、着陆点评估和行星科学研究至关重要;而OHRC具备当前最优的月球轨道成像地面采样能力(20–30 cm/像素),但尚无基于其影像的亚米级DEM公开生成方法。 Method: 基于OHRC非配对归档影像,通过图像元数据几何分析(B/H比与会聚角估计)筛选候选立体像对;采用密集立体匹配与光线三角测量生成点云,并网格化为DEM;再通过迭代最近点(ICP)算法对齐LRO NAC数字地形模型,并施加常数偏置校正以保证绝对高程一致性。 Result: 在五个地理分布的月面区域生成了空间分辨率达24–54 cm的亚米级DEM;垂直方向RMSE为5.85 m(OHRC原生分辨率下),平面定位精度优于30 cm(通过平面特征匹配评估)。 Conclusion: 本工作成功实现了首套基于OHRC影像的全开源亚米级月球DEM生成流程,验证了其高几何精度与可靠性,为未来月球探测任务提供了高质量地形数据支持与可复现的技术范式。 Abstract: High-resolution digital elevation models (DEMs) of the lunar surface are essential for surface mobility planning, landing site characterization, and planetary science. The Orbiter High Resolution Camera (OHRC) on board Chandrayaan-2 has the best ground sampling capabilities of any lunar orbital imaging currently in use by acquiring panchromatic imagery at a resolution of roughly 20-30 cm per pixel. This work presents, for the first time, the generation of sub-metre DEMs from OHRC multi-view imagery using an exclusively open-source pipeline. Candidate stereo pairs are identified from non-paired OHRC archives through geometric analysis of image metadata, employing baseline-to-height (B/H) ratio computation and convergence angle estimation. Dense stereo correspondence and ray triangulation are then applied to generate point clouds, which are gridded into DEMs at effective spatial resolutions between approximately 24 and 54 cm across five geographically distributed lunar sites. Absolute elevation consistency is established through Iterative Closest Point (ICP) alignment against Lunar Reconnaissance Orbiter Narrow Angle Camera (NAC) Digital Terrain Models, followed by constant-bias offset correction. Validation against NAC reference terrain yields a vertical RMSE of 5.85 m (at native OHRC resolution), and a horizontal accuracy of less than 30 cm assessed by planimetric feature matching.[170] Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation
Qiaochu Zhao,Wei Wei,David Horowitz,Richard Bakst,Yading Yuan
Main category: cs.CV
TL;DR: 本文提出IPnP框架,通过迭代提示和伪标签生成,解决医学图像分割中部分器官标注缺失的问题,利用可训练分割网络与冻结基础模型协作,逐步恢复全器官监督,在公开和私有数据集上均取得显著性能提升。
Details
Motivation: 临床实践中常因站点特定需求和高昂标注成本导致医学影像仅部分器官被标注,形成部分标注问题,严重影响分割性能。 Method: 提出IPnP(Iteratively Prompting and Pseudo-labeling)框架,通过可训练的分割网络(专家)与冻结的基础模型(通才)协作,迭代生成并优化未标注器官的伪标签,逐步重建全器官监督信号。 Result: 在模拟部分标注设置的AMOS公开数据集上,IPnP持续优于先前方法,并逼近全标注基准性能;在210例头颈部癌症患者的私有部分标注临床数据集上也验证了其实际有效性。 Conclusion: IPnP为部分标注医学图像分割提供了高效可行的解决方案,兼具强泛化性与临床实用性,推动弱监督医学图像分析发展。 Abstract: Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.[171] ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration
Fengyuan Yang,Luying Huang,Jiazhi Guan,Quanwei Yang,Dongwei Pan,Jianglin Fu,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou,Angela Yao
Main category: cs.CV
TL;DR: 本文提出ONE-SHOT框架,通过解耦生成信号、引入规范空间注入机制与Dynamic-Grounded-RoPE位置编码,在无需复杂3D预处理的前提下,实现细粒度、灵活可控的人-环境视频合成。
Details
Motivation: 现有视频基础模型在人-环境视频的细粒度独立编辑上存在挑战,基于刚性3D几何建模的方法常牺牲生成灵活性,且3D预处理影响可扩展性。 Method: 提出参数高效的ONE-SHOT框架:1)规范空间注入机制,利用跨注意力解耦人体动态与环境线索;2)Dynamic-Grounded-RoPE位置编码,建立异构空间域间对应关系,免去启发式3D对齐;3)Hybrid Context Integration机制保障长时序生成中主体与场景一致性。 Result: 实验表明该方法显著超越SOTA,在结构控制精度与创意多样性两方面均有提升。 Conclusion: ONE-SHOT为无需重训、轻量高效、高可控性的人-环境视频合成提供了新范式,兼顾生成质量与实用性。 Abstract: Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.[172] A global dataset of continuous urban dashcam driving
Md Shadab Alam,Olena Bazilinska,Pavlo Bazilinskyy
Main category: cs.CV
TL;DR: CROWD是一个从YouTube视频中手动筛选出的城市道路前视行车记录仪数据集,强调日常驾驶场景、跨域鲁棒性与交互分析,覆盖全球238个国家/地区,含时间与车辆类型标签及YOLOv11x检测与BoT-SORT跟踪结果。
Details
Motivation: 支持跨域鲁棒性建模和真实城市驾驶场景中的交互分析,避免事故或编辑内容干扰,强调常规、连续、未剪辑的日常驾驶片段。 Method: 从公开YouTube视频中人工筛选、分割并标注分钟级、前视、城市道路行车片段;提供段级时间(昼/夜)与车辆类型标签;集成YOLOv11x对80类COCO目标的检测结果及BoT-SORT多目标跟踪结果;以视频ID+时间戳方式分发,不直接提供原始视频。 Result: 发布包含51,753个片段记录、总计超2万小时的高质量数据集,覆盖六大洲238个国家/地区的7,103个居住地,并配套机器生成检测与跟踪标注的CSV文件。 Conclusion: CROWD填补了大规模、全球化、日常化、未剪辑城市驾驶视觉数据的空白,为自动驾驶鲁棒性、泛化性与交互理解研究提供了可复现、低门槛的基准资源。 Abstract: We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.[173] PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement
Zilong Li,Dongyang Li,Chenglong Ma,Zhan Feng,Dakai Jin,Junping Zhang,Hao Luo,Fan Wang,Hongming Shan
Main category: cs.CV
TL;DR: 本文提出PHASOR,一种基于体积分扩散模型的虚拟对比增强(VCE)方法,通过解剖路由的混合专家(AR-MoE)与强度-相位感知表征对齐(IP-REPA)模块,提升非增强CT到增强CT合成的解剖一致性与对比度保真度。
Details
Motivation: 现有虚拟对比增强(VCE)方法受限于解剖异质性与空间错位,导致增强模式不一致、细节错误;同时临床CECT存在造影剂侵入性与辐射风险问题。 Method: 提出PHASOR框架:1)采用视频扩散模型建模CT体数据为连贯序列以增强结构与体素一致性;2)设计解剖路由的混合专家(AR-MoE)模块,将增强模式绑定解剖语义并引入器官特异性记忆;3)引入强度-相位感知表征对齐(IP-REPA)模块,强化细微对比信号并缓解空间错位影响。 Result: 在三个数据集上的实验表明,PHASOR在合成质量与增强准确性上显著优于当前最优方法。 Conclusion: PHASOR通过结合解剖先验与扩散建模,实现了高保真、解剖一致的虚拟对比增强,为减少临床CECT依赖提供了可靠技术路径。 Abstract: Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.[174] ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Di Wen,Danda Pani Paudel,Luc Van Gool,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出ProOOD方法,通过原型引导的语义填补与尾部挖掘,结合无训练的OOD评分机制(EchoOOD),显著提升3D语义占据预测在长尾分布和OOD检测上的鲁棒性与准确性。
Details
Motivation: 现有3D语义占据预测方法易受长尾类别偏差和分布外(OOD)输入影响,常将异常样本错误高置信度地分配给罕见类。 Method: ProOOD包含三部分:(i) 原型引导的语义填补,用于填充遮挡区域;(ii) 原型引导的尾部挖掘,增强罕见类表征以抑制OOD吸收;(iii) EchoOOD模块,融合局部logit一致性与局部/全局原型匹配,生成体素级OOD分数。 Result: 在五个数据集上达到SOTA:SemanticKITTI上整体mIoU提升+3.57%,尾部类mIoU提升+24.80%;VAA-KITTI上AuPRCr提升+19.34点;显著提升占据估计校准性与OOD检测可靠性。 Conclusion: ProOOD是一种轻量、即插即用的方法,有效缓解长尾偏差与OOD敏感性问题,在自动驾驶安全关键场景中具有实用价值。 Abstract: 3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.[175] ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
Yaoqin Ye,Yiteng Xu,Qin Sun,Xinge Zhu,Yujing Sun,Yuexin Ma
Main category: cs.CV
TL;DR: 本文提出ReMoGen框架,用于实时生成人类在交互环境中的反应动作,通过模块化学习解决数据稀缺与低延迟高保真响应的挑战。
Details
Motivation: 现实世界中人类行为具有交互性,其动作受周围人和场景影响;现有方法难以在数据有限、异构且需低延迟的条件下生成高质量反应动作。 Method: 提出ReMoGen框架:基于大规模单人运动数据学习通用运动先验,并通过独立训练的Meta-Interaction模块适配不同交互域;采用分段生成+轻量级帧级细化模块实现高效在线响应。 Result: 在人-人、人-场景及混合模态交互任务上,ReMoGen生成的动作质量高、连贯性强、响应快,并具备跨场景泛化能力。 Conclusion: ReMoGen是一种有效支持实时、鲁棒、泛化性强的人类交互反应动作生成的模块化框架,适用于虚拟化身、交互动画与人机协作等应用。 Abstract: Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.[176] ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning
Jie Mei,Li-Leng Peng,Keith Fuller,Jenq-Neng Hwang
Main category: cs.CV
TL;DR: 本文提出Prototype-guided Text Prompt Selection (ProTPS)方法,通过学习类特定的视觉原型来引导文本提示的选择与学习,以缓解持续学习中的灾难性遗忘问题,并在多种设置下验证其有效性。
Details
Motivation: 现有基于文本提示的持续学习方法难以学习到能区分新旧类别的唯一文本提示,导致语义特征重叠和灾难性遗忘。 Method: 提出ProTPS方法,联合学习类特定视觉原型和文本提示,利用视觉原型指导文本提示的选择与优化;在类增量(CI)、跨数据集持续(CDC)及类-域增量(CDI)三种设定下进行评估,并构建真实长尾海洋物种数据集Marine112。 Result: ProTPS在CI、CDC和CDI三种设定下均优于近期SOTA方法,尤其在新构建的Marine112数据集上展现出强泛化能力与鲁棒性。 Conclusion: ProTPS通过原型引导机制有效提升了文本提示的判别性与独特性,显著缓解了持续学习中的灾难性遗忘,且所构建的Marine112数据集为社区提供了更具现实挑战性的基准。 Abstract: For continual learning, text-prompt-based methods leverage text encoders and learnable prompts to encode semantic features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn unique text prompts, which implicitly carry semantic information of new classes, so that the semantic features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS)'' to intentionally increase the training flexibility thus encouraging the learning of unique text prompts. Specifically, our ProTPS learns class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts for each class. We first evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Because our ProTPS achieves performance close to the upper bounds, we further collect a real-world dataset with 112 marine species collected over a span of six years, named Marine112, to bring new challenges to the community. Marine112 is authentically suited for the class and domain incremental (CDI) learning setting and is under natural long-tail distribution. The results under three settings show that our ProTPS performs favorably against the recent state-of-the-art methods. The implementation code and Marine112 dataset will be released upon the acceptance of our paper.[177] Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation
Reyhaneh Ahani Manghotay,Jie Liang
Main category: cs.CV
TL;DR: 本文提出MoA-DepthCLIP,一种参数高效的框架,通过轻量级Mixture-of-Adapters模块和选择性微调,将预训练CLIP模型适配于单目深度估计任务,在NYU Depth V2上显著优于基线DepthCLIP,同时大幅减少可训练参数。
Details
Motivation: 利用视觉语言模型(如CLIP)的丰富语义特征进行单目深度估计具有潜力,但现有方法常需大量微调或缺乏几何精度。 Method: 在预训练ViT-B/32主干中引入轻量级Mixture-of-Adapters(MoA)模块,并选择性微调最后几层;结合全局语义上下文向量实现空间感知适配;采用分类+回归混合预测架构;使用含几何约束的复合损失函数。 Result: 在NYU Depth V2上,δ₁精度从0.390提升至0.745,RMSE从1.176降至0.520,性能显著超越DepthCLIP基线,且可训练参数大幅减少。 Conclusion: 轻量级、提示引导的MoA是一种高效迁移视觉语言模型知识至细粒度单目深度估计任务的策略。 Abstract: Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.[178] ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation
Hao Zhang,Lue Fan,Weikang Bian,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li
Main category: cs.CV
TL;DR: ReinDriveGen是一个用于可控编辑动态驾驶场景的框架,支持轨迹编辑以生成安全关键的corner cases,并通过RL后训练提升视频生成质量。
Details
Motivation: 现有方法难以生成安全关键的驾驶corner cases(如碰撞、失控等),且编辑后的场景常超出训练分布,导致生成质量下降。 Method: 构建多帧LiDAR点云场景,引入车辆补全模块恢复360°几何,并渲染为2D条件图像引导视频扩散模型;提出基于RL的后训练策略,使用成对偏好模型和奖励机制提升OOD场景下的生成质量。 Result: 在编辑驾驶场景和新型自车视角合成任务上均优于现有方法,达到SOTA性能。 Conclusion: ReinDriveGen实现了对动态驾驶场景的高保真、可控编辑与视频合成,尤其在分布外安全关键场景中展现出强鲁棒性与实用性。 Abstract: We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.[179] Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach
Maofeng Tang,Hairong Qi
Main category: cs.CV
TL;DR: 本文提出了一种无需显式混合模型的高光谱非线性解混方法LCGU net,基于双向GAN框架,结合循环一致性与线性-非线性混合关联约束,实现了稳定且具竞争力的解混性能。
Details
Motivation: 传统高光谱非线性解混方法依赖先验混合模型,限制了性能与泛化能力;亟需一种不依赖显式模型的通用解混方法。 Method: 提出线性约束的CycleGAN解混网络(LCGU net),构建可逆的混合-解混双向GAN框架,引入循环一致性损失和线性-非线性混合关联约束,避免对具体混合模型的依赖。 Result: 在多个数据集上实验表明,LCGU net性能稳定,与当前基于模型的最先进方法相比具有竞争力。 Conclusion: LCGU net成功实现了无需显式混合模型的高光谱非线性解混,验证了数据驱动约束(循环一致性+线性关联)的有效性,为无模型解混提供了新范式。 Abstract: Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.[180] Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects
Hanzhe Liang,Luocheng Zhang,Junyang Xia,HanLiang Zhou,Bingyang Guo,Yingxi Xie,Can Gao,Ruiyun Yu,Jinbao Wang,Pan Li
Main category: cs.CV
TL;DR: 本文提出了一种面向工业场景的开集监督式3D异常检测方法Open3D-AD,并构建了包含15类、每类5种真实异常的高质量数据集Open-Industry,通过建模正常与异常点云的概率密度分布并减少其重叠,显著提升了未知异常检测性能。
Details
Motivation: 传统自监督3D异常检测假设获取高精度点云代价高昂,但在实际制造中往往可收集少量真实异常样本;因此,作者提出更贴近现实的开集监督式3D异常检测设定。 Method: 提出Open3D-AD方法:利用正常样本、模拟异常和部分真实异常建模正常与异常数据的概率密度分布,并引入对应分布子采样(Correspondence Distributions Subsampling)以降低两类分布重叠;同时适配通用开集异常检测方法以适配点云输入。 Result: 在自建数据集Open-Industry及Real3D-AD、Anomaly-ShapeNet上验证了Open3D-AD的有效性,基准测试与消融实验表明其性能优越,揭示了开集监督式3D异常检测的潜力。 Conclusion: 开集监督式设定更符合工业实际,Open3D-AD通过联合利用有限真实异常、模拟异常与正常样本,实现了对未知异常的高效检测,为3D异常检测提供了新范式与实用基准。 Abstract: Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to identify unknown anomalies at test time. We present Open-Industry, a high-quality industrial dataset containing 15 categories, each with five real anomaly types collected from production lines. We first adapt general open-set anomaly detection methods to accommodate 3D point cloud inputs better. Building upon this, we propose Open3D-AD, a point-cloud-oriented approach that leverages normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. Then, we introduce a simple Correspondence Distributions Subsampling to reduce the overlap between normal and non-normal distributions, enabling stronger dual distributions modeling. Based on these contributions, we establish a comprehensive benchmark and evaluate the proposed method extensively on Open-Industry as well as established datasets including Real3D-AD and Anomaly-ShapeNet. Benchmark results and ablation studies demonstrate the effectiveness of Open3D-AD and further reveal the potential of open-set supervised 3D anomaly detection.[181] Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
Jorge Condor,Nicolas Moenne-Loccoz,Merlin Nimier-David,Piotr Didyk,Zan Gojcic,Qi Wu
Main category: cs.CV
TL;DR: 本文提出了一种名为Neural Harmonic Textures的新神经表示方法,通过在原始体素周围虚拟支架上锚定潜在特征向量,并结合周期性激活函数实现高效高频细节建模,在保持原始体素方法灵活性的同时,显著提升了重建质量与实时渲染性能。
Details
Motivation: 现有基于原始体素(如3D高斯点阵)的方法虽灵活、可扩展性强,但在建模高频细节方面受限于单个原始体素的表达能力。 Method: 在每个原始体素周围构建虚拟支架,锚定潜在特征向量;在光线交点处插值这些特征;引入受傅里叶分析启发的周期性激活函数,将alpha混合转化为加权谐波分量和;最后通过轻量神经网络单次延迟解码输出。 Result: 在实时新视角合成任务中达到SOTA性能,同时弥合了原始体素法与神经场法之间的表征鸿沟;兼容3DGST、Triangle Splatting、2DGS等主流管线,并拓展至2D图像拟合与语义重建。 Conclusion: Neural Harmonic Textures是一种高效、通用且易于集成的神经纹理表示,兼顾表达力与计算效率,为基于原始体素的重建范式提供了重要增强。 Abstract: Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.[182] TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking
Jiyuan Hu,Zechuan Zhang,Zongxin Yang,Yi Yang
Main category: cs.CV
TL;DR: TRACE是一个基于网格引导的3D高斯泼溅(3DGS)编辑框架,通过将视频扩散模型与显式3D几何结构对齐,实现自动化、高保真的场景变换,支持细粒度部件级编辑并保持主体结构完整性。