Skip to content

Table of Contents

cs.CL [Back]

[1] Decomposing Attention To Find Context-Sensitive Neurons

Alex Gibson

Main category: cs.CL

TL;DR: 提出一种基于稳定注意力头的线性近似方法,从GPT2-Small权重中发现响应高层上下文属性的第一层神经元。

Details Motivation: 研究注意力分布广泛且对内容依赖较弱的注意力头,探索其softmax分母在固定词元分布下的稳定性。 Method: 通过在“校准文本”上采样softmax分母,将GPT2-Small第一层多个稳定注意力头的输出组合近似为上下文的线性摘要。 Result: 实现了仅从模型权重和单个校准文本出发,识别出数百个响应高层上下文特征的第一层神经元,包括在校准文本上未激活的神经元。 Conclusion: 该方法揭示了稳定注意力头的组合可解释性,为理解Transformer底层机制提供了新工具。 Abstract: We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a "calibration text", we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn't activate on the calibration text.

[2] Graph-S3: Enhancing Agentic textual Graph Retrieval with Synthetic Stepwise Supervision

Ge Chang,Jinbo Su,Jiacheng Liu,Pengfei Yang,Yuhao Shang,Huiwen Zheng,Hongli Ma,Yan Liang,Yuanchun Li,Yunxin Liu

Main category: cs.CL

TL;DR: 提出Graph-S$^3$,一种基于LLM的文本图推理框架,通过合成的逐步监督信号训练检索器,提升图检索效果。

Details Motivation: 现有基于大语言模型的文本图问答系统在图检索方面表现不佳,主要依赖浅层嵌入相似性或需要大量标注数据的交互式策略,难以兼顾信息丰富性和上下文紧凑性。 Method: 设计了一种数据合成管道来提取黄金子图作为训练信号,并采用两阶段训练方案,利用离线提取的黄金子图对每一步检索过程进行细粒度评估,从而训练LLM-based检索器。 Result: 在三个常用数据集上相比七个强基线平均提升了8.1%的准确率和9.7%的F1分数,尤其在多跳复杂推理任务中优势更明显。 Conclusion: Graph-S$^3$通过合成的逐步监督有效提升了文本图问答中的检索性能,具备高实用性和可扩展性,代码将开源。 Abstract: A significant portion of real-world data is inherently represented as textual graphs, and integrating these graphs into large language models (LLMs) is promising to enable complex graph-based question answering. However, a key challenge in LLM-based textual graph QA systems lies in graph retrieval, i.e., how to retrieve relevant content from large graphs that is sufficiently informative while remaining compact for the LLM context. Existing retrievers suffer from poor performance since they either rely on shallow embedding similarity or employ interactive retrieving policies that demand excessive data labeling and training cost. To address these issues, we present Graph-$S^3$, an agentic textual graph reasoning framework that employs an LLM-based retriever trained with synthetic stepwise supervision. Instead of rewarding the agent based on the final answers, which may lead to sparse and unstable training signals, we propose to closely evaluate each step of the retriever based on offline-extracted golden subgraphs. Our main techniques include a data synthesis pipeline to extract the golden subgraphs for reward generation and a two-stage training scheme to learn the interactive graph exploration policy based on the synthesized rewards. Based on extensive experiments on three common datasets in comparison with seven strong baselines, our approach achieves an average improvement of 8.1\% in accuracy and 9.7\% in F$_1$ score. The advantage is even higher in more complicated multi-hop reasoning tasks. Our code will be open-sourced.

[3] Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks

Arjun Arunasalam,Madison Pickering,Z. Berkay Celik,Blase Ur

Main category: cs.CL

TL;DR: 研究了六种流行的大语言模型在完成日常任务时所体现的隐含价值观,并与100名美国众包工作者进行比较,发现大语言模型在隐含价值观上往往与人类及其他模型不一致。

Details Motivation: 探讨大语言模型在执行主观日常任务时所体现的隐含价值观,并评估其与人类价值观的一致性。 Method: 通过审计六种流行的大语言模型完成30项日常任务的情况,将其表现与100名美国众包工作者进行对比。 Result: 大语言模型在完成日常任务时表现出的隐含价值观常常与人类不一致,且不同模型之间也存在差异。 Conclusion: 当前的大语言模型在隐含价值观方面缺乏与人类的一致性,也缺乏跨模型的一致性,表明在AI助手的价值对齐方面仍有改进空间。 Abstract: Large language models (LLMs) can underpin AI assistants that help users with everyday tasks, such as by making recommendations or performing basic computation. Despite AI assistants' promise, little is known about the implicit values these assistants display while completing subjective everyday tasks. Humans may consider values like environmentalism, charity, and diversity. To what extent do LLMs exhibit these values in completing everyday tasks? How do they compare with humans? We answer these questions by auditing how six popular LLMs complete 30 everyday tasks, comparing LLMs to each other and to 100 human crowdworkers from the US. We find LLMs often do not align with humans, nor with other LLMs, in the implicit values exhibited.

[4] Morpheme Induction for Emergent Language

Brendon Boldt,David Mortensen

Main category: cs.CL

TL;DR: 提出了一种名为CSAR的贪心算法,用于从具有并行语句和意义的新兴语言语料库中诱导词素,通过计算形式与意义之间的互信息来加权并逐步提取词素。

Details Motivation: 为了从新兴语言中自动识别和提取有意义的词素,从而理解其语言结构和演化过程。 Method: 采用贪心策略,基于形式与意义之间的互信息对候选词素进行加权,选择权重最高的词素对,移除后重复该过程(Count, Select, Ablate, Repeat)。 Result: 在合成数据上验证了CSAR的有效性,并与基线方法进行了比较;在人类语言数据上也表现出合理预测能力;进一步分析了多种新兴语言中的同义性和多义性等语言特征。 Conclusion: CSAR能有效诱导新兴语言中的词素,并可用于量化分析其语言特性,具有跨领域应用潜力。 Abstract: We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

[5] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Mengyao Xu,Wenfei Zhou,Yauhen Babakhin,Gabriel Moreira,Ronay Ak,Radek Osmulski,Bo Liu,Even Oldridge,Benedikt Schifferer

Main category: cs.CL

TL;DR: Omni-Embed-Nemotron 是一个统一的多模态检索嵌入模型,支持文本、图像、音频和视频的跨模态与联合模态检索,提升了复杂真实场景下的信息检索能力。

Details Motivation: 现有基于文本的检索器难以处理包含丰富视觉和语义信息的真实文档(如PDF、幻灯片、视频),且对多模态内容支持不足,因此需要更强大的统一检索模型。 Method: 基于ColPali和Qwen2.5-Omni等模型的启发,构建一个支持文本、图像、音频和视频的统一嵌入模型,实现跨模态和联合模态检索,并通过特定架构设计和训练策略优化性能。 Result: 模型在文本、图像和视频检索任务中均表现出有效性,能够准确实现跨模态和联合模态检索。 Conclusion: Omni-Embed-Nemotron 为复杂多模态信息检索提供了一个统一且高效的解决方案,推动了RAG系统在真实场景中的应用。 Abstract: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

[6] Searching for the Most Human-like Emergent Language

Brendon Boldt,David Mortensen

Main category: cs.CL

TL;DR: 本文设计了一个基于信号博弈的新兴通信环境,通过超参数优化生成与人类语言相似性高的新兴语言,并使用XferBench评估其在深度迁移学习中的表现,验证了熵对迁移性能的预测能力及新兴语言系统熵最小化的特性。

Details Motivation: 提高新兴语言与人类语言的相似性,以增强其在自然语言处理任务中的可迁移性和实用性。 Method: 采用基于信号博弈的建模方法,结合XferBench作为目标函数进行超参数优化,利用熵指标分析语言结构特性。 Result: 成功生成了在统计上更接近人类语言的新兴语言,验证了熵与迁移性能之间的相关性,并发现特定超参数组合能提升语言的现实性与迁移效果。 Conclusion: 通过超参数优化和熵分析,可以有效生成更具人类语言特征的新兴语言,为构建可解释、可迁移的智能通信系统提供了可行路径。 Abstract: In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.

[7] SEER: The Span-based Emotion Evidence Retrieval Benchmark

Aneesha Sampath,Oya Aran,Emily Mower Provost

Main category: cs.CL

TL;DR: 本文提出了SEER基准,用于评估大语言模型在文本中识别情感表达片段的能力,强调了跨句子和段落的情感证据检测任务,并通过新标注数据评估了14个开源模型的表现。

Details Motivation: 传统的情感识别任务通常只对整个句子打一个标签,而忽略了具体表达情感的文本片段。为了更深入理解情感表达方式,需要一种能够精确定位情感证据的方法,这在共情对话和临床支持等应用中尤为重要。 Method: 构建了一个名为SEER的新基准,包含两个任务:单句内情感证据识别和五句连续段落中的跨句情感证据识别。该基准基于1200个真实世界句子的新标注数据,评估了14个开源大语言模型的表现。 Result: 实验结果显示,部分模型在单句任务上接近人类平均水平,但在长段落中的准确率下降。错误分析揭示了模型存在过度依赖情感关键词和在中性文本中产生误报等问题。 Conclusion: SEER为评估大语言模型在细粒度情感理解方面提供了新工具,突出了当前模型在处理复杂上下文时的局限性,为进一步改进提供了方向。 Abstract: We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.

[8] ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

Ali Khairallah,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: ALHD是首个大规模综合阿拉伯语数据集,旨在区分人类与大语言模型生成的文本,涵盖多种体裁和阿拉伯语变体,包含40多万个平衡样本,并提供基准实验,揭示跨体裁泛化挑战,特别是在新闻领域。

Details Motivation: 现有阿拉伯语中缺乏高质量、多样化的数据集来有效区分人类与大语言模型生成的文本,限制了检测技术的发展和评估。 Method: 构建包含新闻、社交媒体和评论三类文本的大规模平衡数据集ALHD,覆盖标准阿拉伯语和方言,使用三种主流大模型生成文本并结合多个人类来源;进行严格预处理、标注和标准化划分;在传统分类器、BERT类模型和大语言模型(零样本与少样本)上开展基准测试。 Result: 微调后的BERT模型表现最佳,优于基于大模型的零样本和少样本方法;但所有模型在跨体裁泛化时均表现不稳定,尤其在新闻文本中难以区分人写与机器生成内容,因两者风格高度相似。 Conclusion: ALHD为阿拉伯语生成文本检测提供了可靠基础,揭示了当前方法在跨域场景下的局限性,推动未来对泛化能力、风格模仿及风险防控的研究。 Abstract: We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.

[9] TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning

Fangxu Yu,Hongyu Zhao,Tianyi Zhou

Main category: cs.CL

TL;DR: 本文提出了TS-Reasoner,一种将时间序列基础模型(TSFM)与大语言模型(LLM)对齐用于时间序列理解与推理任务的方法。通过构建合成的时间序列-文本配对数据,并采用两阶段训练策略,在无需微调TSFM的情况下实现高效模态对齐。

Details Motivation: 现有时间序列模型缺乏复杂推理能力,而大语言模型难以理解数值型时间序列数据;如何有效融合二者优势仍是一个挑战。 Method: 提出TS-Reasoner,利用预训练TSFM提取时间序列表征,并通过合成多样化的时间序列-文本对进行对齐训练;采用两阶段训练:先对齐预训练,再指令微调,且冻结TSFM参数。 Result: 在多个基准测试上,TS-Reasoner优于主流LLM、视觉语言模型和时间序列LLM,且训练数据需求不到一半,展现出卓越的数据效率。 Conclusion: TS-Reasoner成功实现了TSFM与LLM的有效对齐,在保持TSFM固定的同时提升了时间序列的复杂推理能力,兼具高性能与高数据效率。 Abstract: Time series reasoning is crucial to decision-making in diverse domains, including finance, energy usage, traffic, weather, and scientific discovery. While existing time series foundation models (TSFMs) can capture low-level dynamic patterns and provide accurate forecasting, further analysis usually requires additional background knowledge and sophisticated reasoning, which are lacking in most TSFMs but can be achieved through large language models (LLMs). On the other hand, without expensive post-training, LLMs often struggle with the numerical understanding of time series data. Although it is intuitive to integrate the two types of models, developing effective training recipes that align the two modalities for reasoning tasks is still an open challenge. To this end, we propose TS-Reasoner that aligns the latent representations of TSFMs with the textual inputs of LLMs for downstream understanding/reasoning tasks. Specifically, we propose a simple yet effective method to curate diverse, synthetic pairs of time series and textual captions for alignment training. We then develop a two-stage training recipe that applies instruction finetuning after the alignment pretraining. Unlike existing works that train an LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it during training. Extensive experiments on several benchmarks demonstrate that TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision Language Models (VLMs), and Time Series LLMs, but also achieves this with remarkable data efficiency, e.g., using less than half the training data.

[10] Identifying Financial Risk Information Using RAG with a Contrastive Insight

Ali Elahi

Main category: cs.CL

TL;DR: 提出一种基于RAG的对比推理层,用于在专业领域(如金融)中生成更具上下文相关性和对比性的分析结果,优于传统RAG方法。

Details Motivation: RAG在提取事实信息方面有效,但在专业推理任务中输出往往过于泛化,缺乏对相似案例或问题的比较分析能力,尤其在金融领域导致风险分析缺乏特异性。 Method: 在RAG基础上引入一种具有同伴感知能力的对比推理层,通过检索和比较相似案例进行推理,增强输出的上下文相关性和差异性分析能力。 Result: 该方法在ROUGE和BERTScore等文本生成指标上优于基线RAG模型,生成的分析更接近人工撰写的股票研究和风险评估。 Conclusion: 所提出的对比推理层能有效提升RAG在专业领域中的推理能力,生成更具针对性和洞察力的内容。 Abstract: In specialized domains, humans often compare new problems against similar examples, highlight nuances, and draw conclusions instead of analyzing information in isolation. When applying reasoning in specialized contexts with LLMs on top of a RAG, the pipeline can capture contextually relevant information, but it is not designed to retrieve comparable cases or related problems. While RAG is effective at extracting factual information, its outputs in specialized reasoning tasks often remain generic, reflecting broad facts rather than context-specific insights. In finance, it results in generic risks that are true for the majority of companies. To address this limitation, we propose a peer-aware comparative inference layer on top of RAG. Our contrastive approach outperforms baseline RAG in text generation metrics such as ROUGE and BERTScore in comparison with human-generated equity research and risk.

[11] Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

Sayan Ghosh,Shahzaib Saqib Warraich,Dhruv Tarsadiya,Gregory Yauney,Swabha Swayamdipta

Main category: cs.CL

TL;DR: 本文提出了Consensus Graphs (ConGrs),一种基于有向无环图的数据结构,用于整合大语言模型多次采样生成的长文本响应中的共识与语义差异,通过轻量级对齐算法和辅助判断模型构建,并设计任务相关的解码方法生成最终响应。实验表明该方法在事实准确性、拒绝不可回答问题和数学推理任务上均优于基线方法。

Details Motivation: 现有方法难以高效整合大语言模型多次采样生成的长文本响应中的认知信号(如共识与差异),缺乏有效表示和利用响应间语义变化的结构化方法。 Method: 提出Consensus Graphs(ConGrs),使用生物信息学中的轻量级词法序列对齐算法构建响应间的共享信息与语义变异的DAG结构,并结合辅助LM判断器进行优化;设计任务依赖的解码策略从ConGr中生成最终输出。 Result: 在两个传记生成任务上事实准确率最高提升31%,相比其他方法减少80%以上的LM判断器使用;在三个拒绝响应任务上拒答率提高56%;在MATH和AIME推理任务上比自验证和多数投票基线最高提升6个百分点。 Conclusion: ConGrs提供了一种灵活且高效的方法来捕捉和利用语言模型响应中的变异信息,显著提升了生成结果的事实性、可靠性和推理能力。 Abstract: Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.

[12] Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance

Ahmed Alajrami,Xingwei Tan,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文研究了在指令调优中引入扰动(如去除停用词或打乱词语顺序)对大语言模型性能的影响,发现这种方法不仅能提高模型对噪声指令的鲁棒性,在某些情况下还能提升其在原始和扰动基准任务上的下游性能。

Details Motivation: 大语言模型在指令调优后虽能更好完成任务,但对指令表述的小变化敏感,因此需要提升其对噪声指令的鲁棒性。 Method: 在指令调优过程中引入多种扰动方式(如去除停用词、词语打乱),并在MMLU、BBH、GSM8K等基准上评估模型在原始和扰动指令下的表现,分析学习动态和行为变化。 Result: 指令调优中使用扰动指令能在某些情况下提升模型在原始和扰动测试集上的性能,并增强其对噪声输入的抵抗能力。 Conclusion: 在指令调优中加入扰动指令有助于提升大语言模型的鲁棒性和实际应用中的稳定性,建议在训练数据中包含扰动版本的指令。 Abstract: Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs' resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs' performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.

[13] TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering

Zhaohan Meng,Zaiqiao Meng,Siwei Liu,Iadh Ounis

Main category: cs.CL

TL;DR: TriMediQ是一种基于三元组结构的知识图谱方法,通过将患者对话转化为结构化三元组并支持多跳推理,显著提升了大语言模型在交互式医疗问答中的准确性。

Details Motivation: 现有大语言模型在静态单轮医疗问答中表现良好,但在模拟真实临床问诊的多轮交互场景中表现下降,因难以从无结构的对话日志中进行连贯的临床推理。 Method: 提出TriMediQ框架:使用冻结的三元组生成器从患者回应中提取临床相关三元组,并构建知识图谱;通过可训练的图编码器和投影模块捕捉关系信息,在保持LLM权重冻结的情况下进行微调,指导推理过程中的多跳推理。 Result: 在两个交互式问答基准上评估,TriMediQ在iMedQA数据集上比五个基线模型最高提升10.4%的准确率。 Conclusion: 将患者回应转化为结构化的三元组知识图谱能有效提升大语言模型在多轮临床对话中的推理准确性,为部署可靠的LLM医疗助手提供了可行方案。 Abstract: Large Language Models (LLMs) perform strongly in static and single-turn medical Question Answer (QA) benchmarks, yet such settings diverge from the iterative information gathering process required in practical clinical consultations. The MEDIQ framework addresses this mismatch by recasting the diagnosis as an interactive dialogue between a patient and an expert system, but the reliability of LLMs drops dramatically when forced to reason with dialogue logs, where clinical facts appear in sentences without clear links. To bridge this gap, we introduce TriMediQ, a triplet-structured approach that summarises patient responses into triplets and integrates them into a Knowledge Graph (KG), enabling multi-hop reasoning. We introduce a frozen triplet generator that extracts clinically relevant triplets, using prompts designed to ensure factual consistency. In parallel, a trainable projection module, comprising a graph encoder and a projector, captures relational information from the KG to enhance expert reasoning. TriMediQ operates in two steps: (i) the projection module fine-tuning with all LLM weights frozen; and (ii) using the fine-tuned module to guide multi-hop reasoning during inference. We evaluate TriMediQ on two interactive QA benchmarks, showing that it achieves up to 10.4\% improvement in accuracy over five baselines on the iMedQA dataset. These results demonstrate that converting patient responses into structured triplet-based graphs enables more accurate clinical reasoning in multi-turn settings, providing a solution for the deployment of LLM-based medical assistants.

[14] What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Andrew Halterman,Katherine A. Keith

Main category: cs.CL

TL;DR: 本文探讨了在计算社会科学中使用大语言模型进行文本分类时,概念化过程的重要性及其对下游统计推断的影响。

Details Motivation: 强调在大语言模型时代,概念化步骤常被忽视,可能导致偏差。 Method: 通过模拟研究展示概念化引起的偏差无法仅通过提高模型准确率或事后偏差校正方法来纠正。 Result: 发现概念化错误会导致下游估计的偏倚,且这种偏倚难以通过现有手段完全消除。 Conclusion: 提醒研究者在使用大语言模型时仍需重视概念化,并提供了实现低成本、无偏、低方差估计的具体建议。 Abstract: Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.

[15] CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making

Hasibur Rahman,Hanan Salam

Main category: cs.CL

TL;DR: CCD-Bench是一个新基准,用于评估大语言模型在跨文化价值观冲突下的决策能力,揭示了现有模型偏好特定文化、缺乏深层多元价值整合的问题。

Details Motivation: 现有基准未能评估大语言模型在多种文化价值观直接冲突时的决策能力,因此需要一个专门针对跨文化价值冲突的新评估工具。 Method: 构建包含2182个开放式困境的CCD-Bench数据集,覆盖七个领域,每个困境对应十个GLOBE文化集群的匿名选项,并采用分层拉丁方设计减少顺序效应;评估17个非推理型大语言模型的响应偏好和推理过程。 Result: 模型显著偏好北欧和德语区欧洲文化选项(分别平均20.2%和12.4%),而东欧和中东及北非选项被严重低估(5.6%-5.8%);尽管87.9%的推理提及多个GLOBE维度,但多为表面多元,常重组未来导向与绩效导向,极少涉及决断性或性别平等(均低于3%);顺序效应可忽略,模型响应模式按开发者 lineage 聚类而非地理分布。 Conclusion: 当前对齐机制倾向于推广一种共识导向的世界观,难以应对需权力协商、基于权利的推理或性别敏感分析的场景;CCD-Bench推动评估从单一偏见检测转向多元决策,并强调需实质性融入多样世界观的对齐策略。 Abstract: Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer's V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.

[16] Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models

Adam Filipek

Main category: cs.CL

TL;DR: 本文提出了一种名为Reactive Transformer (RxT)的新架构,旨在解决传统Transformer在对话AI中因无状态性和二次计算复杂度导致的效率问题。RxT通过引入固定大小的短期记忆(STM)系统,将对话轮次作为离散事件实时处理,实现了从数据驱动到事件驱动的范式转变。

Details Motivation: 传统Transformer模型在长对话中由于需重复处理不断增长的历史记录,导致计算成本和延迟过高,难以支持高效、实时的交互。因此需要一种更高效的架构来支持经济可行的长对话应用。 Method: RxT采用生成-解码器基于当前查询和先前记忆状态生成响应,随后由记忆编码器和专用的记忆注意力网络异步更新短期记忆(STM),实现响应生成与记忆更新的解耦,并保持固定大小的上下文存储。 Result: 实验表明,RxT相比同规模的无状态基线模型,在合成数据上表现出更优性能,并实现恒定的推理延迟,总用户端成本从二次降低为线性复杂度。 Conclusion: RxT通过事件驱动机制和分离式记忆更新,有效解决了Transformer在长对话中的扩展性与延迟问题,为实现低延迟、有状态且经济高效的长文本对话提供了可行方案。 Abstract: The Transformer architecture has become the de facto standard for Large Language Models (LLMs), demonstrating remarkable capabilities in language understanding and generation. However, its application in conversational AI is fundamentally constrained by its stateless nature and the quadratic computational complexity ($O(L^2)$) with respect to sequence length $L$. Current models emulate memory by reprocessing an ever-expanding conversation history with each turn, leading to prohibitive costs and latency in long dialogues. This paper introduces the Reactive Transformer (RxT), a novel architecture designed to overcome these limitations by shifting from a data-driven to an event-driven paradigm. RxT processes each conversational turn as a discrete event in real-time, maintaining context in an integrated, fixed-size Short-Term Memory (STM) system. The architecture features a distinct operational cycle where a generator-decoder produces a response based on the current query and the previous memory state, after which a memory-encoder and a dedicated Memory Attention network asynchronously update the STM with a representation of the complete interaction. This design fundamentally alters the scaling dynamics, reducing the total user-facing cost of a conversation from quadratic ($O(N^2 \cdot T)$) to linear ($O(N \cdot T)$) with respect to the number of interactions $N$. By decoupling response generation from memory updates, RxT achieves low latency, enabling truly real-time, stateful, and economically viable long-form conversations. We validated our architecture with a series of proof-of-concept experiments on synthetic data, demonstrating superior performance and constant-time inference latency compared to a baseline stateless model of comparable size.

[17] LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction

Ikram Belmadani,Parisa Nazari Hashemi,Thomas Sebbag,Benoit Favre,Guillaume Fortier,Solen Quiniou,Emmanuel Morin,Richard Dufour

Main category: cs.CL

TL;DR: 本文提出了三种结合大语言模型、标注指南、合成数据和后处理的方法,参与了EvalLLM 2025法语生物医学命名实体识别与健康事件抽取挑战赛(少样本设置)。GPT-4.1通过精心设计的提示在极低资源场景下表现最佳。

Details Motivation: 在法语生物医学领域少样本条件下,如何有效利用大语言模型提升命名实体识别和事件抽取性能。 Method: 采用三种方法:基于GPT-4.1的上下文学习(含自动示例选择与标注指南摘要)、GLiNER模型在合成数据上微调并由LLM后处理验证、LLaMA-3.1-8B-Instruct在相同合成数据上微调;事件抽取复用GPT-4.1的上下文学习策略。 Result: GPT-4.1在NER任务上取得61.53%的macro-F1,在事件抽取上达到15.02%,表现最优,凸显了提示工程在低资源场景中的关键作用。 Conclusion: 精心设计的提示结合大语言模型在极低资源的法语生物医学信息抽取任务中具有显著优势。 Abstract: This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French (few-shot setting). For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in post-processing, and (3) the open LLM LLaMA-3.1-8B-Instruct, fine-tuned on the same synthetic corpus. Event extraction uses the same ICL strategy with GPT-4.1, reusing the guideline summary in the prompt. Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction, highlighting the importance of well-crafted prompting to maximize performance in very low-resource scenarios.

[18] Decoupling Task-Solving and Output Formatting in LLM Generation

Haikang Deng,Po-Nien Kung,Nanyun Peng

Main category: cs.CL

TL;DR: 本文提出了一种名为Deco-G的解码框架,通过将格式遵循与任务求解显式分离来提升大语言模型在复杂指令下的表现。

Details Motivation: 随着提示词复杂度增加,大模型难以同时满足推理要求和严格格式,需要解耦二者以提升性能。 Method: 引入Deco-G框架,使用独立的可追踪概率模型(TPM)处理格式合规性,LLM仅关注任务求解,并在每一步解码中结合两者的概率输出。采用指令感知蒸馏、灵活的trie构建算法和HMM状态剪枝三项创新提升效率和可扩展性。 Result: 在数学推理、LLM-as-a-Judge和事件论元抽取等多个任务上实现了1.0%到6.0%的相对性能提升,并保证格式完全合规。 Conclusion: Deco-G通过解耦任务求解与格式遵循,有效提升了大模型在复杂指令下的性能和可靠性,具有广泛适用性和实用价值。 Abstract: Large language models (LLMs) are increasingly adept at following instructions containing task descriptions to solve complex problems, such as mathematical reasoning and automatic evaluation (LLM-as-a-Judge). However, as prompts grow more complex, models often struggle to adhere to all instructions. This difficulty is especially common when instructive prompts intertwine reasoning directives -- specifying what the model should solve -- with rigid formatting requirements that dictate how the solution must be presented. The entanglement creates competing goals for the model, suggesting that more explicit separation of these two aspects could lead to improved performance. To this front, we introduce Deco-G, a decoding framework that explicitly decouples format adherence from task solving. Deco-G handles format compliance with a separate tractable probabilistic model (TPM), while prompts LLMs with only task instructions. At each decoding step, Deco-G combines next token probabilities from the LLM with the TPM calculated format compliance likelihood to form the output probability. To make this approach both practical and scalable for modern instruction-tuned LLMs, we introduce three key innovations: instruction-aware distillation, a flexible trie-building algorithm, and HMM state pruning for computational efficiency. We demonstrate the effectiveness of Deco-G across a wide range of tasks with diverse format requirements, including mathematical reasoning, LLM-as-a-judge, and event argument extraction. Overall, our approach yields 1.0% to 6.0% relative gain over regular prompting practice with guaranteed format compliance.

[19] Can an LLM Induce a Graph? Investigating Memory Drift and Context Length

Raquib Bin Yousuf,Aadyant Khatri,Shengzhe Xu,Mandar Sharma,Naren Ramakrishnan

Main category: cs.CL

TL;DR: 本文提出了一种新的评估大语言模型(LLM)的方法,强调在信息密集场景中进行关系推理任务的重要性,而非简单的‘针尖检索’任务。研究发现,当涉及从噪声文本中提取结构化知识时,LLM在较短的上下文长度内就开始出现记忆漂移和上下文遗忘现象,这比现有基准测试所显示的结果更为严重。即使是专为推理设计的模型(如OpenAI o1)也难以避免早期记忆漂移。结果表明,当前模型在从非结构化输入中抽象出结构化知识方面存在显著局限性,并呼吁对模型架构进行改进以增强长距离推理能力。

Details Motivation: 现有的评估基准多依赖于简单的‘针尖检索’或续写任务,无法真实反映大语言模型在复杂、信息密集场景下的表现。因此,需要一种更贴近实际应用的评估方式,特别是在需要模型从自然语言中推导结构化关系知识的任务上。 Method: 作者设计了基于图结构的关系推理任务,要求模型从分布广泛、夹杂无关信息的长上下文中归纳出隐含的结构化知识。通过对比不同模型在该任务上的表现,分析其记忆保持能力和上下文利用效率。 Result: 实验结果显示,大语言模型在执行此类关系推理任务时,有效上下文长度显著缩短,表现出明显的记忆漂移和上下文遗忘。即使是专门优化用于推理的模型(如OpenAI o1)也无法避免这一问题。 Conclusion: 当前的大语言模型在从非结构化文本中提取结构化知识方面存在根本性限制,现有基准可能高估了其实际推理能力。未来需要在模型架构层面进行改进,以支持更有效的长程依赖建模和复杂推理。 Abstract: Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic 'needle in a haystack' retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models' ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.

[20] Towards Unsupervised Speech Recognition at the Syllable-Level

Liming Wang,Junrui Ni,Kai-Wei Chang,Saurabhchand Bhati,David Harwath,Mark Hasegawa-Johnson,James R. Glass

Main category: cs.CL

TL;DR: 本文提出了一种基于音节级别的掩码语言建模的无监督语音识别(UASR)框架,避免了对图音转换工具(G2P)和GAN方法的依赖,在LibriSpeech上实现了最高40%的字符错误率(CER)相对降低,并有效推广到此前难以处理的中文语言。

Details Motivation: 现有的基于音素的无监督语音识别方法依赖昂贵资源(如G2P),且在音素边界模糊的语言中因训练不稳定而泛化能力差,因此需要一种更鲁棒、资源需求更低的新方法。 Method: 提出一种基于音节级别的UASR框架,采用掩码语言建模方法,无需G2P转换,避免了GAN-based方法带来的训练不稳定问题。 Result: 在LibriSpeech数据集上实现了最高40%的相对CER降低,且在中文上表现出良好的泛化能力,显著优于先前方法。 Conclusion: 该音节级掩码语言建模框架为无监督语音识别提供了一种高效、稳定的解决方案,特别适用于低资源及音素边界模糊的语言,推动了多模态非平行数据学习的发展。 Abstract: Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

[21] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Xiangyu Peng,Cab Qin,Zeyuan Chen,Ran Xu,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本文提出了UniDoc-Bench,首个大规模、真实场景下的多模态检索增强生成(MM-RAG)基准,基于70,000个真实PDF页面构建,包含1,600个多模态问答对,涵盖多种查询类型,并支持四种范式间的统一比较。

Details Motivation: 现有MM-RAG评估碎片化,无法反映以文档为中心的多模态应用场景,缺乏统一、真实的评测基准。 Method: 从真实PDF中提取文本、表格和图像证据,构建包含1,600个QA对的数据集;设计统一协议支持文本、图像及融合模式的对比;20%样本经多人标注与专家仲裁确保质量。 Result: 实验表明,多模态文本-图像融合RAG系统在性能上持续优于单模态及联合多模态嵌入检索方法;揭示了视觉上下文如何补充文本证据,并发现当前多模态嵌入的不足。 Conclusion: UniDoc-Bench为MM-RAG提供了可靠评测平台,证明融合式多模态检索更有效,且指出了当前技术的局限性与改进方向。 Abstract: Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

[22] Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Nisar Hussain,Amna Qasim,Gull Mehak,Muhammad Zain,Momina Hafeez,Grigori Sidorov

Main category: cs.CL

TL;DR: 本文提出了一种基于QLoRA的微调框架,用于提升罗马乌尔都语-英语混杂文本中的冒犯性语言检测性能。通过将代码混合数据翻译成英文以利用英文大模型,尽管牺牲了部分混合特征,但在低资源环境下取得了优异的分类效果。

Details Motivation: 由于语法不明确、拼写不一致和标注数据稀缺,罗马乌尔都语等代码混合语言中的冒犯性语言检测对NLP系统构成挑战。 Method: 采用Google Translate将罗马乌尔都语-英语混合数据集翻译为英文,并使用QLoRA技术对多种大模型(如LLaMA 3 8B、Mistral 7B等)进行高效微调,以适应低资源场景下的分类任务。 Result: 在人工标注的数据集上评估显示,Meta LLaMA 3 8B取得了91.45的最高F1分数,Mistral 7B达到89.66,均优于传统模型。 Conclusion: QLoRA在低资源代码混合语言的冒犯性检测中表现出色,验证了大语言模型在此类任务中的潜力,为多语言冒犯性内容检测系统的构建提供了可扩展的路径。 Abstract: The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

[23] MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction

Yue Huang,Yanyuan Chen,Dexuan Xu,Weihua Yue,Huamin Zhang,Meikang Qiu,Yu Huang

Main category: cs.CL

TL;DR: 本文提出了一种名为MedReflect的通用框架,通过激发大语言模型的自我反思能力来解决医学问题,无需外部检索或大量标注数据。

Details Motivation: 现有的大语言模型在医学问题解决中依赖外部知识检索或大量标注数据,存在开销大、成本高的问题,且性能受限。因此需要一种更高效、低成本的方法来提升模型在医学领域的推理能力。 Method: MedReflect框架通过生成单次反射链实现医学问题求解,包括初始假设生成、自我提问、自我回答和决策优化四个步骤,利用模型自身的反思能力进行自我验证和改进,无需外部检索或大量标注数据。此外,仅使用2000个随机采样的训练样本进行轻量微调即可构建高效的医学数据集。 Result: MedReflect在多个医学基准测试中实现了显著的绝对准确率提升,同时大幅减少了对标注数据的需求,证明了该方法在降低外部监督依赖和减少任务特定微调数据方面的有效性。 Conclusion: 大语言模型可以通过自我反思机制学习并解决专业医学问题,MedReflect为减少对外部资源依赖提供了可行路径,具有成本效益高、泛化能力强的优点。 Abstract: Medical problem solving demands expert knowledge and intricate reasoning. Recent studies of large language models (LLMs) attempt to ease this complexity by introducing external knowledge verification through retrieval-augmented generation or by training on reasoning datasets. However, these approaches suffer from drawbacks such as retrieval overhead and high annotation costs, and they heavily rely on substituted external assistants to reach limited performance in medical field. In this paper, we introduce MedReflect, a generalizable framework designed to inspire LLMs with a physician-like reflective thinking mode. MedReflect generates a single-pass reflection chain that includes initial hypothesis generation, self-questioning, self-answering and decision refinement. This self-verified and self-reflective nature releases large language model's latent capability in medical problem-solving without external retrieval or heavy annotation. We demonstrate that MedReflect enables cost-efficient medical dataset construction: with merely 2,000 randomly sampled training examples and a light fine-tuning, this approach achieves notable absolute accuracy improvements across a series of medical benchmarks while cutting annotation requirements. Our results provide evidence that LLMs can learn to solve specialized medical problems via self-reflection and self-improve, reducing reliance on external supervision and extensive task-specific fine-tuning data.

[24] TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation

Ramtin Kakavand,Ebrahim Ansari

Main category: cs.CL

TL;DR: 提出了一种名为TreePrompt的新颖示例选择方法,通过学习大语言模型偏好,在树形结构框架内识别高质量且上下文相关的示例,结合AFSP或随机选择可提升翻译性能。

Details Motivation: 现有少样本提示方法多关注查询与示例的相似性,忽视示例质量,影响机器翻译效果。 Method: 提出TreePrompt方法,利用树形结构框架学习大语言模型对示例的偏好,并结合K-NN和AFSP探索相似性与质量之间的平衡。 Result: 在英-波斯(MIZAN)和英-德(WMT19)两个语言对上的实验表明,TreePrompt与AFSP或随机选择结合能提高翻译性能。 Conclusion: TreePrompt能有效提升少样本提示下的机器翻译质量,验证了兼顾示例质量与相似性的重要性。 Abstract: Large Language Models (LLMs) have consistently demonstrated strong performance in machine translation, especially when guided by high-quality prompts. Few-shot prompting is an effective technique to improve translation quality; however, most existing example selection methods focus solely on query-to-example similarity and do not account for the quality of the examples. In this work, we propose TreePrompt, a novel example selection approach that learns LLM preferences to identify high-quality, contextually relevant examples within a tree-structured framework. To further explore the balance between similarity and quality, we combine TreePrompt with K-Nearest Neighbors (K-NN) and Adaptive Few-Shot Prompting (AFSP). Evaluations on two language pairs - English-Persian (MIZAN) and English-German (WMT19) - show that integrating TreePrompt with AFSP or Random selection leads to improved translation performance.

[25] Cross-Lingual Multi-Granularity Framework for Interpretable Parkinson's Disease Diagnosis from Speech

Ilias Tougui,Mehdi Zakroum,Mounir Ghogho

Main category: cs.CL

TL;DR: 提出一种基于音素、音节和词级别的多语言帕金森病语音检测方法,采用双向LSTM与多头注意力机制,发现音素级别分析性能最优,且关键语音特征与临床常用指标一致。

Details Motivation: 现有基于语音的帕金森病检测通常分析整段语句,可能忽略具有诊断价值的特定语音单元,因此需要更细粒度的分析方法以提升检测效果。 Method: 构建自动化管道提取语音中的音素、音节和词等时间对齐的语音单元,使用意大利语、西班牙语和英语数据集,采用双向LSTM结合多头注意力机制,在不同粒度级别上比较帕金森病检测性能。 Result: 音素级别分析表现最佳,AUROC为93.78%±2.34%,准确率为92.17%±2.43%;注意力分析显示最具信息量的特征包括持续元音(/a/, /e/, /o/, /i/)、交替音节(/ta/, /pa/, /la/, /ka/)和/pataka/序列,与临床指标一致。 Conclusion: 细粒度语音分析(尤其是音素级别)能显著提升多语言帕金森病语音检测的性能,并验证了模型关注的特征与临床实践的一致性,具有实际应用潜力。 Abstract: Parkinson's Disease (PD) affects over 10 million people worldwide, with speech impairments in up to 89% of patients. Current speech-based detection systems analyze entire utterances, potentially overlooking the diagnostic value of specific phonetic elements. We developed a granularity-aware approach for multilingual PD detection using an automated pipeline that extracts time-aligned phonemes, syllables, and words from recordings. Using Italian, Spanish, and English datasets, we implemented a bidirectional LSTM with multi-head attention to compare diagnostic performance across the different granularity levels. Phoneme-level analysis achieved superior performance with AUROC of 93.78% +- 2.34% and accuracy of 92.17% +- 2.43%. This demonstrates enhanced diagnostic capability for cross-linguistic PD detection. Importantly, attention analysis revealed that the most informative speech features align with those used in established clinical protocols: sustained vowels (/a/, /e/, /o/, /i/) at phoneme level, diadochokinetic syllables (/ta/, /pa/, /la/, /ka/) at syllable level, and /pataka/ sequences at word level. Source code will be available at https://github.com/jetliqs/clearpd.

[26] Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs

Deshan Sumanathilaka,Nicholas Micallef,Julian Hough

Main category: cs.CL

TL;DR: 本研究探讨了在多语言环境下,不平衡的少样本示例如何影响基于大语言模型的词义消歧(WSD)任务,发现非英语语言中存在因样本分布不均导致的错误预测问题,而英语则不受此影响。

Details Motivation: 由于少样本提示在自然语言处理中的广泛应用,研究其在多语言词义消歧任务中的潜在偏差具有重要意义,特别是样本分布不平衡可能引发的模型误判问题。 Method: 采用GLOSSGPT提示方法,在五种语言(英语、德语、西班牙语、法语和意大利语)上评估GPT-4o和LLaMA-3.1-70B模型的表现,分析不同样本分布对WSD性能的影响。 Result: 实验结果表明,不平衡的少样本示例会导致多语言WSD中的错误预测,但该现象在英语中未出现;模型对样本分布高度敏感。 Conclusion: 在多语言少样本WSD任务中,需采用平衡且具代表性的提示策略,以避免因样本偏差导致的性能下降。 Abstract: Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

[27] Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

Majid Asgari-Bidhendi,Muhammad Amin Ghaseminia,Alireza Shahbazi,Sayyed Ali Hossayni,Najmeh Torabian,Behrouz Minaei-Bidgoli

Main category: cs.CL

TL;DR: 本文介绍了Rezwan,一个通过全自动管道构建的大型AI辅助圣训语料库,包含超过120万条叙述,具有多语言翻译、智能标音、摘要生成和主题标注等丰富注释,并展示了AI在宗教文本处理中的高效性与高质量。

Details Motivation: 为了提升伊斯兰研究和数字人文领域的研究效率,需要将大量原始圣训文本转化为结构化、多语言且语义丰富的研究就绪资源,同时降低人工整理的时间与成本。 Method: 基于Maktabat Ahl al-Bayt等数字资源,采用大语言模型(LLMs)实现文本分割、传述链与正文分离、验证及多层次增强,包括机器翻译、智能标音、抽象摘要、主题标记和跨文本语义分析。 Result: 在1,213条随机抽样的叙述中,专家评估显示链文分离和摘要的准确率接近人类水平(均为9.33/10),整体质量显著优于Noor语料库(8.46/10 vs 3.66/10),且成本远低于人工方式,原需22.9万小时的工作在数月内完成。 Conclusion: 该研究证明了AI可在保持高准确性的同时大规模处理宗教文本,为伊斯兰文化遗产提供了可扩展、多语言、语义增强的新范式,显著提升了研究效率与可访问性。 Abstract: This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain--text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain--text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.

[28] Mechanistic Interpretability of Socio-Political Frames in Language Models

Hadi Asghari,Sami Nenno

Main category: cs.CL

TL;DR: 本文研究了大语言模型在生成和识别社会政治语境中深层认知框架(如“严父”和“关爱父母”框架)方面的能力,发现这些框架可在模型的隐层表示中被定位到特定维度,且与人类概念表达密切相关。

Details Motivation: 理解大语言模型如何捕捉和表达人类深层认知结构,尤其是在社会政治语境中的意识形态框架。 Method: 利用机械可解释性方法,在零样本设置下分析大语言模型生成和识别‘严父’与‘关爱父母’认知框架的能力,并探测其在模型隐藏层中的表征位置。 Result: 发现存在与特定认知框架强相关的单一维度,表明这些抽象概念在模型内部有可识别的线性表征。 Conclusion: 大语言模型不仅能流畅生成和识别深层认知框架,而且这些概念在模型内部具有结构性表征,有助于理解模型如何编码复杂的人类价值观和意识形态。 Abstract: This paper explores the ability of large language models to generate and recognize deep cognitive frames, particularly in socio-political contexts. We demonstrate that LLMs are highly fluent in generating texts that evoke specific frames and can recognize these frames in zero-shot settings. Inspired by mechanistic interpretability research, we investigate the location of the `strict father' and `nurturing parent' frames within the model's hidden representation, identifying singular dimensions that correlate strongly with their presence. Our findings contribute to understanding how LLMs capture and express meaningful human concepts.

[29] Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

Canhui Wu,Qiong Cao,Chang Li,Zhenfang Wang,Chao Xue,Yuwei Fan,Wei Xi,Xiaodong He

Main category: cs.CL

TL;DR: 本文提出了一种名为Step Pruner(SP)的强化学习框架,旨在解决大推理模型在生成简洁响应时出现的“过度思考”问题,通过步骤感知奖励和动态停止机制,在保持高准确率的同时显著减少输出长度。

Details Motivation: 大推理模型虽然性能强大,但常因冗长的推理过程导致效率低下,现有方法难以区分推理步数与生成token数量,且易引发模型通过跳过关键步骤来作弊,因此需要一种更精细的控制机制。 Method: 提出Step Pruner(SP),采用步骤感知的奖励函数,在保证正确性的前提下惩罚冗余推理步骤,并对错误回答不给予奖励;引入动态停止机制,当某推理步骤超过长度上限时停止参数更新,防止模型合并步骤以规避惩罚。 Result: 在四个推理基准上的实验表明,SP在显著减少响应长度的同时达到最先进的准确性;例如在AIME24上,token使用量减少了69.7%。 Conclusion: Step Pruner有效引导大推理模型实现更紧凑、高效的推理过程,解决了传统RL方法中因仅惩罚token而导致的优化偏差和作弊行为,为构建高效推理系统提供了新思路。 Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.

[30] Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches

Mehedi Hasan Emon

Main category: cs.CL

TL;DR: 该研究使用INCEpTION工具对体育新闻(特别是板球报道)中的修辞关系进行标注,并比较了基于BERT、DistilBERT和逻辑回归模型的手动与自动标注方法,结果显示DistilBERT在分类准确率上表现最佳。

Details Motivation: 旨在评估大型语言模型在修辞关系自动标注中的有效性,推动话语分析与基于Transformer的自然语言处理技术的融合。 Method: 采用INCEpTION工具进行人工标注,训练并比较BERT、DistilBERT和逻辑回归模型在修辞关系分类任务上的性能。 Result: DistilBERT在准确率上表现最优,优于BERT和逻辑回归模型,显示出其在高效话语关系预测中的潜力。 Conclusion: DistilBERT在修辞关系自动识别任务中表现优异,支持其在话语解析中的应用,为NLP中 discourse parsing 提供了有效工具。 Abstract: This research explores the annotation of rhetorical relations in discourse using the INCEpTION tool and compares manual annotation with automatic approaches based on large language models. The study focuses on sports reports (specifically cricket news) and evaluates the performance of BERT, DistilBERT, and Logistic Regression models in classifying rhetorical relations such as elaboration, contrast, background, and cause-effect. The results show that DistilBERT achieved the highest accuracy, highlighting its potential for efficient discourse relation prediction. This work contributes to the growing intersection of discourse parsing and transformer-based NLP. (This paper was conducted as part of an academic requirement under the supervision of Prof. Dr. Ralf Klabunde, Linguistic Data Science Lab, Ruhr University Bochum.) Keywords: Rhetorical Structure Theory, INCEpTION, BERT, DistilBERT, Discourse Parsing, NLP.

[31] Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles

Nusrat Jahan Lia,Shubhashis Roy Dipta,Abdullah Khan Zehady,Naymul Islam,Madhusodan Chakraborty,Abdullah Al Wasif

Main category: cs.CL

TL;DR: 本文介绍了首个用于孟加拉语新闻政治立场检测的基准数据集,包含200篇标注文章,分为政府倾向、批评政府和中立三类,并对28种大语言模型进行了评估,发现模型在识别批评内容上表现较好,但在中立立场识别上存在显著困难,且倾向于误判为政府倾向。

Details Motivation: 由于缺乏标注数据集和相关研究,孟加拉语媒体中的政治立场检测面临挑战,尤其需要理解语言特征、文化背景和隐含偏见等因素。 Method: 构建了一个包含200篇孟加拉语新闻文章的数据集,标注其政治立场,并对28个专有和开源大语言模型进行系统评估,分析其在不同类别上的表现及偏差。 Result: 模型在检测政府批评立场时F1最高达0.83,但对中立立场的F1低至0.00,普遍存在过度预测政府倾向的问题。 Conclusion: 该数据集为低资源语言下的媒体偏见研究提供了基础,揭示了当前大语言模型在处理模糊和中立文本时的局限性,有助于推动相关改进。 Abstract: Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.

[32] PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian

Mohammad Amin Abbasi,Hassan Naderi

Main category: cs.CL

TL;DR: 本研究提出了PsychoLexTherapy,一个用于在波斯语中使用小型语言模型(SLMs)模拟心理治疗推理的框架,强调文化适应性、隐私保护和多轮对话中的结构化记忆。

Details Motivation: 为波斯语等代表性不足的语言开发具备文化敏感性和治疗连贯性的对话系统面临数据稀缺和隐私问题,现有大模型难以在本地部署,因此需要轻量、高效且符合文化背景的解决方案。 Method: 采用三阶段方法:首先通过PsychoLexEval评估SLM的心理学知识;然后设计面向推理的PsychoLexTherapy框架并支持本地部署;最后构建两个评估数据集(PsychoLexQuery和PsychoLexDialogue)以与多种基线进行比较,实验涵盖简单提示、多智能体辩论和结构化推理路径。 Result: 在PsychoLexQuery上,该框架在自动评估和人类偏好测试中均优于所有基线;在多轮对话测试中,长期记忆模块显著提升共情、连贯性、文化契合度和个人化表现,而简单的上下文拼接导致信息丢失和不连贯。 Conclusion: PsychoLexTherapy为波斯语心理治疗对话系统提供了实用、隐私保护且文化对齐的基础,贡献了新数据集、可复现的评估流程以及关于结构化记忆在治疗推理中作用的实证见解。 Abstract: This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi-turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on-device deployment, enabling use without external servers. Development followed a three-stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning-oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets-PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark against multiple baselines. Experiments compared simple prompting, multi-agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM-as-a-judge evaluation and was ranked highest by human evaluators in a single-turn preference study. In multi-turn tests with PsychoLexDialogue, the long-term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.

[33] Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs

Junjie Luo,Rui Han,Arshana Welivita,Zeleikun Di,Jingfu Wu,Xuzhe Zhi,Ritu Agarwal,Gordon Gao

Main category: cs.CL

TL;DR: 该研究利用大语言模型(LLM)从410万条患者评论中自动提取医生的“大五人格”特质和患者主观评价,验证了LLM评估与人类判断的高度一致性,并揭示了医生特质与患者满意度之间的系统性关联。

Details Motivation: 理解患者如何感知医生对于提升医患信任、沟通和满意度至关重要,但传统方法难以大规模分析患者叙述。因此,需要一种自动化、可扩展的方法来量化这些感知。 Method: 研究采用基于大语言模型(LLM)的分析流程,对来自美国22.7万名医生的410万条在线评论进行处理,推断其大五人格特质及五项患者导向的主观判断;通过多模型比较和人类专家评分对比验证方法有效性,并进行全国范围内的模式分析和聚类以识别医生类型。 Result: LLM评估与人类专家评分高度一致(相关系数0.72–0.89),并与患者满意度显著相关(r = 0.41–0.81,均p<0.001);发现男性医生在所有特质上评分更高,儿科和精神科医生更显共情特质,所有特质均正向预测满意度;聚类识别出四种医生类型:'全面优秀型'(33.8%)、'表现不佳型'(22.6%)等。 Conclusion: 从患者叙述中自动提取人格特质可提供可解释且经过验证的大规模指标,有助于医疗质量评估、偏见检测和 workforce 发展。 Abstract: Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and LLM assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly high traits) to "Underperforming" (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.

[34] Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Yang Xu,Xuanming Zhang,Min-Hsuan Yeh,Jwala Dhamala,Ousmane Dia,Rahul Gupta,Yixuan Li

Main category: cs.CL

TL;DR: 本文提出了一种新的多智能体模拟框架,用于在长时程交互中探测和评估大语言模型(LLM)的欺骗行为。研究发现,欺骗行为具有模型依赖性,随事件压力增加而上升,并持续削弱监督者的信任。

Details Motivation: 现有的LLM欺骗行为研究多局限于单轮提示,无法捕捉现实中长期、动态的交互情境。因此,需要一个能够模拟复杂、多阶段任务中欺骗行为的评估框架。 Method: 构建了一个包含执行者、监督者和独立欺骗审计员的多智能体系统,在多阶段、相互依赖的任务序列中模拟动态上下文压力,并对11个前沿LLM进行广泛实验。 Result: 发现欺骗行为具有模型差异性,随压力增加而增多,并导致监督者信任持续下降;定性分析揭示了掩盖、含糊其辞和伪造三种主要欺骗策略。 Conclusion: 欺骗是长时程交互中的一种新兴风险,该框架为在真实、信任敏感场景中评估未来LLM提供了基础。 Abstract: Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.

[35] Named Entity Recognition in COVID-19 tweets with Entity Knowledge Augmentation

Xuankang Zhang,Jiangming Liu

Main category: cs.CL

TL;DR: 提出了一种新的实体知识增强方法,用于提升社交媒体和生物医学文本中的命名实体识别性能,特别是在COVID-19相关非正式文本中。

Details Motivation: 由于社交媒体中关于新冠疫情的文本具有非正式性和标注数据稀缺,且需要大量领域知识,现有命名实体识别研究受限。 Method: 提出一种实体知识增强方法,并在推特和PubMed数据集上进行实验,结合监督学习与少样本场景。 Result: 实验表明该方法在正式和非正式文本中均能有效提升命名实体识别性能。 Conclusion: 所提出的实体知识增强方法可有效应对新冠疫情相关文本的NER挑战,并适用于更广泛的生物医学命名实体识别任务。 Abstract: The COVID-19 pandemic causes severe social and economic disruption around the world, raising various subjects that are discussed over social media. Identifying pandemic-related named entities as expressed on social media is fundamental and important to understand the discussions about the pandemic. However, there is limited work on named entity recognition on this topic due to the following challenges: 1) COVID-19 texts in social media are informal and their annotations are rare and insufficient to train a robust recognition model, and 2) named entity recognition in COVID-19 requires extensive domain-specific knowledge. To address these issues, we propose a novel entity knowledge augmentation approach for COVID-19, which can also be applied in general biomedical named entity recognition in both informal text format and formal text format. Experiments carried out on the COVID-19 tweets dataset and PubMed dataset show that our proposed entity knowledge augmentation improves NER performance in both fully-supervised and few-shot settings. Our source code is publicly available: https://github.com/kkkenshi/LLM-EKA/tree/master

[36] AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Bo Yang,Yunkui Chen,Lanfei Feng,Yu Zhang,Xiao Xu,Jianyu Zhang,Nueraili Aierken,Runhe Huang,Hongjian Lin,Yibin Ying,Shijian Li

Main category: cs.CL

TL;DR: 本文提出了AgriGPT-VL套件,包括大规模农业视觉-语言数据集Agri-3M-VL、专用视觉-语言模型AgriGPT-VL,以及评估基准AgriBench-VL-4K。该框架在农业多模态任务上表现优异,同时保持文本能力,且所有资源将开源。

Details Motivation: 由于缺乏领域定制模型、高质量视觉-语言数据集和系统评估方法,农业领域的多模态大模型应用受限。因此,需要构建专门针对农业的统一多模态框架。 Method: 提出AgriGPT-VL框架:1)构建大规模农业视觉-语言数据集Agri-3M-VL,使用多智能体数据生成器;2)设计渐进式训练策略训练农业专用模型AgriGPT-VL,包括文本定位、浅层/深层多模态对齐和GRPO精调;3)建立紧凑但具挑战性的评估套件AgriBench-VL-4K,结合开放性问题与LLM-as-a-judge评估机制。 Result: AgriGPT-VL在AgriBench-VL-4K上优于主流通用多模态模型,且在文本任务AgriBench-13K上无性能下降;消融实验验证了各训练阶段的有效性。 Conclusion: AgriGPT-VL通过专用数据、模型设计与评估体系,在不牺牲语言能力的前提下显著提升农业多模态理解性能,推动农业AI的可复现研究与低资源部署。 Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.

[37] LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Jiarui Liu,Jivitesh Jain,Mona Diab,Nishant Subramani

Main category: cs.CL

TL;DR: 本文研究了如何通过分析大语言模型的内部激活来预测输出的正确性,并探索模型内部是否包含关于外部上下文有效性的信号,提出了一种基于模型内部状态的简单分类器,可在生成第一个输出令牌时以约75%的准确率预测输出正确性,显著优于基于提示的基线方法。

Details Motivation: 大语言模型常以高置信度生成错误信息,如何判断何时需要检索上下文以及所用上下文的有效性仍具挑战。 Method: 利用可解释性方法,基于模型中间层激活训练分类器,预测输出正确性,并引入指标区分正确、错误和无关上下文。 Result: 在六个不同模型上的实验表明,基于首个输出令牌的中间层激活的分类器可达到约75%的预测准确率,且所提模型内部指标在区分上下文正确性方面显著优于提示基线方法。 Conclusion: 模型内部激活蕴含输出正确性和上下文有效性的信号,可用于早期审计和提升大语言模型的可信度。 Abstract: Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope

[38] Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Thanapol Popit,Natthapath Rungseesiripak,Monthol Charattrakool,Saksorn Ruangtanusak

Main category: cs.CL

TL;DR: 本研究首次系统探讨了泰语文本的实时对话结束点(EOT)检测,比较了零样本、少样本大模型提示与轻量级Transformer模型的监督微调方法,提出基于词元边界的二分类方案,并利用YODAS语料库和泰语语言特征实现低延迟高精度的EOT检测,为设备端智能代理提供了可行方案。

Details Motivation: 传统基于音频静音的端点检测在用户犹豫或特定语言现象下表现不佳且延迟高,亟需一种可靠、低延迟的泰语对话结束检测方法以支持流畅的语音交互。 Method: 将EOT检测建模为基于词元边界的二分类任务,使用YODAS语料库的转录字幕,结合泰语句尾助词等语言特征,比较了紧凑型大语言模型的零样本与少样本提示性能,并对轻量级Transformer模型进行监督微调。 Result: 发现准确率与延迟之间存在明显权衡,微调的小型模型在保持极低延迟的同时达到较高准确率,适合用于设备端实时代理。 Conclusion: 小型微调模型可实现接近即时的EOT判断,优于提示大模型,本研究为泰语建立了EOT检测基线,并提供了可公开部署的实现方案。 Abstract: Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

[39] Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?

Nelvin Tan,James Asikin Cheung,Yu-Ching Shih,Dong Yang,Amol Salunkhe

Main category: cs.CL

TL;DR: 本文研究了在大语言模型(LLM)作为黑箱且调用成本高的实际限制下,如何通过引入反事实推理来帮助识别对分类决策最重要的词汇。作者提出了“决策变化率”框架来量化关键词的重要性,实验结果表明使用反事实有助于提升LLM的可解释性。

Details Motivation: 由于大语言模型在文本分类任务中表现出色,但其决策过程缺乏透明度,且在实际应用中常为黑盒、调用成本高,因此需要有效的方法来解释其决策。 Method: 提出了一种名为“决策变化率”的框架,通过引入反事实推理来评估和量化影响LLM分类决策的关键词汇的重要性。 Result: 实验结果表明,结合反事实推理能够有效帮助大语言模型识别对其分类决策贡献最大的词语。 Conclusion: 使用反事实推理可以提升大语言模型在文本分类任务中的可解释性,同时适应黑盒和高调用成本的实际约束。 Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs' decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM's ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.

[40] Small Language Models for Emergency Departments Decision Support: A Benchmark Study

Zirui Wang,Jiajun Wu,Braden Teitge,Jessalyn Holodinsky,Steve Drew

Main category: cs.CL

TL;DR: 本文提出一个针对急诊科(ED)决策支持的小型语言模型(SLM)综合基准,评估结合通用域和医学语料训练的SLM表现。结果显示,未进行医学微调的通用SLM在多项医疗基准测试中意外优于医学微调模型,表明急诊场景下可能无需专门的医学微调。

Details Motivation: 由于急诊科环境节奏快、风险高,且存在硬件限制、运营成本和隐私问题,小型语言模型(SLM)因其高效性和推理能力具有实际部署优势,因此需要评估适合ED场景的SLM性能。 Method: 构建涵盖MedMCQA、MedQA-4Options和PubMedQA等数据集的综合基准,使用在通用域和医学语料混合训练的SLM进行评估,并模拟真实ED医生日常任务。 Result: 实验结果表明,未经医学微调的通用域SLM在多个急诊相关基准上表现优于经过医学领域微调的模型。 Conclusion: 在急诊科决策支持任务中,可能无需对SLM进行专门的医学微调,通用域训练的SLM已足够有效,这为低成本、高效率部署提供了依据。 Abstract: Large language models (LLMs) have become increasingly popular in medical domains to assist physicians with a variety of clinical and operational tasks. Given the fast-paced and high-stakes environment of emergency departments (EDs), small language models (SLMs), characterized by a reduction in parameter count compared to LLMs, offer significant potential due to their inherent reasoning capability and efficient performance. This enables SLMs to support physicians by providing timely and accurate information synthesis, thereby improving clinical decision-making and workflow efficiency. In this paper, we present a comprehensive benchmark designed to identify SLMs suited for ED decision support, taking into account both specialized medical expertise and broad general problem-solving capabilities. In our evaluations, we focus on SLMs that have been trained on a mixture of general-domain and medical corpora. A key motivation for emphasizing SLMs is the practical hardware limitations, operational cost constraints, and privacy concerns in the typical real-world deployments. Our benchmark datasets include MedMCQA, MedQA-4Options, and PubMedQA, with the medical abstracts dataset emulating tasks aligned with real ED physicians' daily tasks. Experimental results reveal that general-domain SLMs surprisingly outperform their medically fine-tuned counterparts across these diverse benchmarks for ED. This indicates that for ED, specialized medical fine-tuning of the model may not be required.

[41] Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

Yunfan Zhang,Kathleen McKeown,Smaranda Muresan

Main category: cs.CL

TL;DR: 研究了链式思维(CoT)推理技术在构建可引导多元模型中的应用,发现强化学习与可验证奖励(RLVR)方法表现最佳且训练样本效率高。

Details Motivation: 大型语言模型通常反映单一价值观,限制了其在需要理解复杂人类观点任务中的应用,因此需要支持可引导的多元主义。 Method: 探索了CoT提示、人工编写CoT微调、合成解释微调以及带可验证奖励的强化学习(RLVR)等方法。 Result: 在Value Kaleidoscope和OpinionQA数据集上评估显示,RLVR方法始终优于其他方法,并表现出较强的训练样本效率。 Conclusion: RLVR是实现可引导多元主义的有效方法,具有较高的性能和样本效率。 Abstract: Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.

[42] What Makes Diffusion Language Models Super Data Learners?

Zitian Gao,Haoming Luo,Lynx Chen,Jason Klein Liu,Ran Tao,Joey Zhou,Bryan Dai

Main category: cs.CL

TL;DR: 扩散语言模型在数据受限情况下表现出色,本文通过消融实验发现输入令牌的随机掩码是提高数据效率的主要因素,同时MLP dropout和权重衰减也能带来类似增益,表明随机正则化广泛提升了多周期训练中的数据效率。

Details Motivation: 尽管扩散语言模型在低数据条件下表现出高数据效率,但其背后机制尚不清楚,因此需要探究这种效率的来源。 Method: 进行广泛的消融实验,分离出影响数据效率的不同因素,并分析随机掩码、MLP dropout和权重衰减的作用。 Result: 发现输入令牌的随机掩码在提升数据效率中起主导作用;同时MLP dropout和权重衰减也能取得类似的性能增益。 Conclusion: 随机正则化(如随机掩码、dropout和权重衰减)是提升扩散语言模型在多周期训练中数据效率的关键机制。 Abstract: Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.

[43] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Zixin Song,Bowen Zhang,Qian-Wen Zhang,Di Yin,Xing Sun,Chunping Li

Main category: cs.CL

TL;DR: 本文提出了PoLi-RL,一种新颖的Point-to-List强化学习框架,用于条件语义文本相似度(C-STS)任务。通过两阶段课程学习和并行切片排序奖励机制(PSRR),在官方C-STS基准上取得了48.18的Spearman相关系数,刷新了交叉编码器架构的SOTA结果。

Details Motivation: 现有C-STS方法多局限于判别模型,未能充分利用大语言模型(LLM)和强化学习(RL)的最新进展。而传统RL方法因粗粒度、复杂的奖励信号难以有效优化非可导的Spearman排序指标,导致性能提升有限。 Method: 提出PoLi-RL框架:第一阶段使用点级奖励训练模型建立基础评分能力;第二阶段引入结合点级、成对和列表级目标的混合奖励以精调模型。设计了并行切片排序奖励(PSRR)机制,通过对不同样本同位置生成结果分片并行计算排序奖励,实现细粒度的信用分配。 Result: 在官方C-STS基准上达到48.18的Spearman相关系数,显著优于现有方法,成为交叉编码器架构下的新SOTA。消融实验验证了两阶段课程学习与PSRR机制的有效性。 Conclusion: PoLi-RL是首个成功将强化学习应用于C-STS的工作,提供了一种强大且精确的范式,用于训练LLM处理复杂、基于排序的条件判断任务,为后续研究开辟了新方向。 Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.

[44] Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Honglin Lin,Qizhi Pei,Xin Gao,Zhuoshi Pan,Yu Li,Juntao Li,Conghui He,Lijun Wu

Main category: cs.CL

TL;DR: 本文提出了Caco,一种通过代码驱动增强思维链(CoT)的新型框架,能够自动生成高质量、可验证且多样化的推理数据,显著提升大模型在数学推理任务中的性能和泛化能力。

Details Motivation: 现有推理方法存在生成不可控、质量不足和多样性有限的问题,且基于代码的CoT方法通常局限于预定义的数学问题,缺乏可扩展性和通用性。 Method: Caco首先在统一的代码格式下微调一个基于代码的CoT生成器,然后通过代码执行验证和规则过滤自动合成大量高质量、结构多样的推理轨迹,并将其反向工程为自然语言指令和语言思维链。 Result: 在Caco-1.3M数据集上的实验表明,使用Caco训练的模型在数学推理基准上表现优异,优于现有强基线,且在未见任务上展现出更强的泛化能力。 Conclusion: Caco建立了一种无需人工干预即可构建自持续、可信推理系统的新范式,通过代码锚定验证和指令多样性提升了推理的可靠性与适应性。 Abstract: Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

[45] Unveiling LLMs' Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence

Fengying Ye,Shanshan Wang,Lidia S. Chao,Derek F. Wong

Main category: cs.CL

TL;DR: 该研究从概念映射、隐喻-字面库和句法敏感性三个角度分析大语言模型(LLMs)的隐喻处理能力,发现其存在生成无关解释、依赖训练数据中的隐喻线索而非上下文、对句法不规则更敏感等问题。

Details Motivation: 尽管LLMs在知识整合和上下文推理方面表现出色,但其对隐喻的理解机制尚不清楚,亟需深入探究其在复杂语言现象中的表现。 Method: 通过嵌入空间投影评估概念映射,构建隐喻与字面表达的对照库以识别内在隐喻知识,并测试不同句法结构对LLM隐喻理解的影响。 Result: LLMs产生15%-25%的概念无关解释,依赖训练数据中的隐喻标志而非上下文线索,且更易受句法异常干扰而非真正理解隐喻结构。 Conclusion: 当前LLMs在隐喻分析方面存在明显局限,需要更强大的计算方法来提升其对隐喻等复杂语言现象的理解能力。 Abstract: Metaphor analysis is a complex linguistic phenomenon shaped by context and external factors. While Large Language Models (LLMs) demonstrate advanced capabilities in knowledge integration, contextual reasoning, and creative generation, their mechanisms for metaphor comprehension remain insufficiently explored. This study examines LLMs' metaphor-processing abilities from three perspectives: (1) Concept Mapping: using embedding space projections to evaluate how LLMs map concepts in target domains (e.g., misinterpreting "fall in love" as "drop down from love"); (2) Metaphor-Literal Repository: analyzing metaphorical words and their literal counterparts to identify inherent metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how metaphorical syntactic structures influence LLMs' performance. Our findings reveal that LLMs generate 15\%-25\% conceptually irrelevant interpretations, depend on metaphorical indicators in training data rather than contextual cues, and are more sensitive to syntactic irregularities than to structural comprehension. These insights underline the limitations of LLMs in metaphor analysis and call for more robust computational approaches.

[46] Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy (v20251005)

Nuwan I. Senaratna

Main category: cs.CL

TL;DR: 本文介绍了斯里兰卡议会记录、法律判决、政府出版物、新闻和旅游统计数据的开放、机器可读文档数据集集合,旨在支持计算语言学、法律分析、社会政治研究和多语言自然语言处理的研究。

Details Motivation: 为了促进斯里兰卡多语言环境下计算语言学和社会科学研究的发展,提供高质量、易获取的公开数据资源。 Method: 收集并整理来自斯里兰卡的多种公开文档数据,涵盖三种语言(僧伽罗语、泰米尔语和英语),通过自动化管道每日更新,并在GitHub和Hugging Face上镜像发布。 Result: 截至v20251005,该集合包含13个数据集,共215,670份文档(60.3 GB),支持多种研究应用。 Conclusion: 这些开放数据集为多语言NLP、法律分析和社会政治研究提供了宝贵资源,同时强调了数据许可与伦理问题的重要性。 Abstract: We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. As of v20251005, the collection currently comprises 215,670 documents (60.3 GB) across 13 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations.

[47] Fine Tuning Methods for Low-resource Languages

Tim Bakkenes,Daniel Wang,Anton Johansson

Main category: cs.CL

TL;DR: 本项目通过构建文化相关的数据集并对Gemma 2模型进行后训练,提升了该模型在一种代表性不足的语言中的表现,并展示了如何推广此方法以促进生成式AI在不同文化中的应用与文化传承。

Details Motivation: 大型语言模型的发展主要集中在英语和西方文化上,导致其在其他语言和文化背景下的表现不佳,忽视了文化多样性。 Method: 开发了一种可推广的方法来准备具有文化相关性的数据集,并对Gemma 2模型进行后训练,以提升其在特定非主流语言中的性能。 Result: 成功提升了Gemma 2模型在目标语言上的表现,验证了文化定制化数据集对模型性能的积极影响。 Conclusion: 通过文化相关的数据准备和模型微调,可以有效提升大模型在非主流语言和文化中的适用性,为全球各地的文化保护和AI普及提供了可行路径。 Abstract: The rise of Large Language Models has not been inclusive of all cultures. The models are mostly trained on English texts and culture which makes them underperform in other languages and cultural contexts. By developing a generalizable method for preparing culturally relevant datasets and post-training the Gemma 2 model, this project aimed to increase the performance of Gemma 2 for an underrepresented language and showcase how others can do the same to unlock the power of Generative AI in their country and preserve their cultural heritage.

[48] Self Speculative Decoding for Diffusion Large Language Models

Yifeng Gao,Ziang Ji,Yuxuan Wang,Biqing Qi,Hanlin Xu,Linfeng Zhang

Main category: cs.CL

TL;DR: 提出了一种名为自推测解码(SSD)的无损推理加速方法,利用扩散式大语言模型自身进行多位置预测和分层验证,在单次前向传播中实现多个token的并行生成与验证,无需额外模块,实现了最高3.46倍的加速,且输出与逐步解码完全一致。

Details Motivation: 现有的扩散式大语言模型并行解码方法生成结果偏离逐步解码,导致性能下降,限制了实际应用,因此需要一种既能保持输出一致性又能提升推理速度的方法。 Method: 提出自推测解码(SSD),通过模型自身作为推测和验证器,引入自起草机制,并在单次前向传播中使用分层验证树对多个位置进行预测和验证,充分利用dLLM的并行预测能力。 Result: 在LLaDA和Dream等开源模型上实现了最高3.46倍的推理速度提升,同时保证生成结果与逐步解码完全相同,且无需额外模型或模块,降低了内存开销。 Conclusion: SSD是一种高效、无损的推理加速方法,解决了当前dLLM并行解码中的性能退化问题,提升了推理效率,具有良好的实用性和部署前景。 Abstract: Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results of current parallel decoding methods deviate from stepwise decoding, introducing potential performance degradation, which limits their practical deployment. To address this problem, we propose \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration method that leverages the dLLM itself as both speculative decoding drafter and verifier without auxiliary modules. SSD introduces a self-drafting mechanism where the model generates predictions for multiple positions, then verifies them through hierarchical verification trees in a single forward pass. Unlike traditional speculative decoding that requires separate draft models, SSD eliminates model redundancy and memory overhead by exploiting the dLLM's inherent parallel prediction capability for multiple positions. This self-speculative approach allows the model to progressively verify and accept multiple tokens in a single forward pass. Our experiments demonstrate that SSD achieves up to 3.46$\times$ speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream. Code will be made publicly available on GitHub.

[49] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

Wengao Ye,Yan Liang,Lianlei Shan

Main category: cs.CL

TL;DR: 提出了一种无需参数更新的测试时推理增强框架LTPO,通过优化隐式思维向量并利用基于置信度的内在奖励信号,在复杂任务上显著提升了大模型的鲁棒性与性能。

Details Motivation: 隐式推理在分布外复杂任务上表现脆弱,现有方法缺乏鲁棒性,需要一种不依赖外部监督且高效的推理增强机制。 Method: 将中间隐式思维向量作为动态参数,使用在线策略梯度法,基于冻结LLM输出分布的置信度信号进行优化,无需生成文本或更新模型参数。 Result: 在五个推理基准上表现优于或媲美强基线,在AIME等极具挑战的任务上显著提升准确率,而其他方法接近失效。 Conclusion: LTPO为大语言模型提供了一种高效、鲁棒的测试时推理增强方案,特别适用于高难度复杂推理场景。 Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.

[50] CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

Zhengyang Tang,Zihan Ye,Chenyu Huang,Xuhan Huang,Chengpeng Li,Sihang Li,Guanhua Chen,Ming Yan,Zizhuo Wang,Hongyuan Zha,Dayiheng Liu,Benyou Wang

Main category: cs.CL

TL;DR: 提出CALM框架,通过轻量级修正和强化学习提升大推理模型在优化建模任务中的表现,开发出4B参数的STORM模型,在多个基准上达到新SOTA。

Details Motivation: 现有领域适配方法无法有效利用现代大推理模型的高级推理模式,直接在传统非反思数据集上微调效果有限。 Method: 提出CALM框架,通过专家干预识别推理错误并提供简洁修正提示,生成高质量修正数据进行监督微调,并结合强化学习进一步优化模型。 Result: STORM模型(4B参数)在五个优化建模基准上平均准确率达68.9%,性能媲美671B模型。 Conclusion: 基于动态提示的数据合成能有效保留并增强现代大推理模型的原生推理能力,为复杂优化建模任务提供了更高效、可扩展的路径。 Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

[51] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuoran Zhuang,Ye Chen,Xia Zeng,Chao Luo,Luhui Liu,Yihan Chen

Main category: cs.CL

TL;DR: 提出了一种名为REPO的强化学习后训练框架,用于将大语言模型部署为在线旅行社中的商务谈判代理,通过结合多种奖励机制提升对话质量和说服力,显著优于现有方法。

Details Motivation: 传统微调或单一奖励优化方法在复杂、多轮说服性对话中容易过拟合脚本,难以捕捉细腻的说服风格,且无法有效执行可验证的业务约束,因此需要一种能兼顾人类偏好、操作规范和确定性规则的新型训练框架。 Method: 提出Reward-Enhanced Policy Optimization (REPO),结合三种异构奖励:基于偏好的奖励模型(RM)用于人类对齐,奖励裁判(RJ)用于高层说服行为和SOP合规,程序化奖励函数(RF)用于数值、格式和安全规则的确定性检查,并设计简单机制融合三者以防止奖励博弈。 Result: 在约150轮真实对话和225轮不良案例对话的生产级评估中,REPO将平均对话评分提升至4.63,显著优于基线及其他方法(如DPO、GRPO等),优秀回复占比达66.67%(较GRPO提升23.34个百分点),不良案例修复率达93.33%(其中75.56%为干净修复),并展现出主动共情、本地化推理和策略校准等涌现能力。 Conclusion: REPO通过融合多源奖励信号,在保持业务约束的同时显著提升了LLM作为谈判代理的说服力与对话质量,具备实际部署价值,并揭示了复杂任务中LLM的潜在涌现能力。 Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.

[52] Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright,Sarah Masud,Jared Moore,Srishti Yadav,Maria Antoniak,Chan Young Park,Isabelle Augenstein

Main category: cs.CL

TL;DR: 本文提出了一种衡量大语言模型(LLM)输出中现实世界主张多样性的新方法,以研究知识崩溃问题。研究发现,尽管较新的模型生成的主张更多样化,但几乎所有模型的表征多样性仍低于基础网络搜索;模型规模越大,表征多样性越低,而检索增强生成(RAG)能提升多样性,效果因文化背景而异;此外,与维基百科相比,LLM对国家特定主张的表达更偏向英语而非本地语言,揭示了表征上的偏差。

Details Motivation: 大语言模型倾向于生成同质化的文本,可能导致知识崩溃。现有研究局限于封闭式选择题或模糊语义特征,缺乏跨时间和文化背景的趋势分析。因此,需要一种新的方法来系统评估不同文化与主题下的知识多样性。 Method: 提出并应用一种测量表征多样性的新方法,对27个大语言模型、涵盖12个国家的155个主题以及来自真实用户对话的200种提示变体进行大规模实证研究,并与网络搜索和维基百科等传统知识源进行比较。 Result: 较新的模型生成的主张更具多样性,但几乎所有模型的表征多样性仍低于基础网络搜索;模型规模增大与表征多样性下降相关;检索增强生成(RAG)有助于提升多样性,但效果受文化背景影响;LLM在表达国家特定主张时更偏向英语而非本地语言,显示出文化表征不足。 Conclusion: 尽管大语言模型在不断发展,但在表征全球多元知识方面仍存在显著局限,存在知识崩溃风险,尤其在非英语文化背景下表现更差;引入外部知识(如RAG)可部分缓解该问题,但仍需改进以实现真正的表征公平与多样性。 Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

[53] Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Guijin Son,Donghun Yang,Hitesh Laxmichand Patel,Amit Agarwal,Hyunwoo Ko,Chanuk Lim,Srikant Panda,Minhyuk Kim,Nikunj Drolia,Dasol Choi,Kyong-Ha Lee,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出了一种名为Language-Mixed CoT的推理模式,结合英语和目标语言(以韩语为例)进行思维链推理,并构建了大规模韩语推理数据集Yi-Sang。基于此训练的KO-REAson-35B模型在多个基准上达到SOTA性能,验证了该方法在语言特定推理中的有效性。

Details Motivation: 现有推理模型多依赖英文长链思维链,而对非英语语言的推理能力研究不足。本文旨在探索语言混合推理模式,提升非英语语言(如韩语)下的模型推理能力,并填补语言特定推理研究的空白。 Method: 提出Language-Mixed CoT框架,交替使用英语和目标语言进行推理;构建包含579万韩语提示和370万长推理链的Yi-Sang数据集;在六类模型家族中训练九个不同规模的模型(4B-35B),并评估其在九项基准上的表现。 Result: 最佳模型KO-REAson-35B在九个基准中五个排名第一、四个排名第二,平均得分为64.0±25;小规模和中等模型平均提升18.6分;消融实验表明Language-Mixed CoT优于单语CoT,并带来跨语言与多模态性能增益。 Conclusion: Language-Mixed CoT是一种有效的语言特定推理范式,能显著提升非英语语言下的模型推理能力,尤其在资源相对有限的语言中具有广泛应用潜力。 Abstract: Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

[54] LongTail-Swap: benchmarking language models' abilities on rare words

Robin Algayres,Charles-Éric Saint-James,Mahi Luthra,Jiayi Shen,Dongyan Lin,Youssef Benchekroun,Rashel Moritz,Juan Pino,Emmanuel Dupoux

Main category: cs.CL

TL;DR: 本文提出了LT-Swap基准,用于评估语言模型在低数据环境下对罕见词的学习能力,重点关注词汇分布的长尾部分。实验结果表明,现有模型在罕见词上的表现较差,且不同架构在长尾性能上差异显著,揭示了某些架构在罕见词泛化方面的优势。

Details Motivation: 受儿童能通过少量数据学会新词的启发,希望构建一个能衡量语言模型在低数据条件下学习罕见词能力的基准,弥补现有BabyLM挑战仅关注高频词的不足。 Method: 提出LT-Swap基准,构建包含可接受与不可接受句子对的测试集,隔离罕见词的语义和句法使用;在BabyLM的10M和100M预训练语料上分别建立测试集,采用零样本方式通过计算句子对的平均对数概率来评估16个语言模型。 Result: 实验结果显示语言模型在罕见词上的表现普遍较差,且不同架构在长尾部分的性能差异比在高频词部分更明显,说明LT-Swap能更有效地区分模型在罕见词学习上的优劣。 Conclusion: LT-Swap为评估语言模型在低资源条件下对罕见词的泛化能力提供了有效工具,揭示了当前模型的局限性,并为未来设计更擅长处理长尾词汇的模型架构提供了方向。 Abstract: Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We've also made the code publicly avail

[55] Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

Karthik Viswanathan,Sang Eon Park

Main category: cs.CL

TL;DR: 提出了一种基于累积量展开的框架,用于量化大语言模型在下一个词预测中如何内化高阶统计结构,并通过实验验证了其在不同提示和训练阶段的有效性。

Details Motivation: 理解大语言模型如何在训练过程中学习并内部表示高阶统计结构,特别是上下文依赖性和不同类型内容(如数学与普通文本)之间的处理差异。 Method: 将每一层logit分布的softmax熵视为对其“中心”分布的扰动,推导出可分离高阶相关性的闭式累积量观测指标,并在GPT-2和Pythia模型上进行实证分析。 Result: 发现结构化提示产生典型的上升- plateau 累积量曲线,而打乱的提示则保持平坦;训练过程中所有累积量单调增加后饱和;数学提示显示出与普通文本不同的累积量特征。 Conclusion: 累积量分析是一种轻量且数学基础坚实的工具,可用于探测高维神经网络中的特征学习动态。 Abstract: We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer's logit distribution as a perturbation around its "center" distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model's progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.

[56] SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Harshil Vejendla

Main category: cs.CL

TL;DR: SliceMoE是一种新的MoE架构,通过在token的隐藏向量切片级别进行路由,提升专家利用率和模型效率。

Details Motivation: 传统MoE在token级别路由导致专家负载不均、容量瓶颈和专业化受限,SliceMoE旨在解决这些问题。 Method: 将d维嵌入划分为S个切片,每个切片由轻量级共享路由器选择top-k专家;专家独立处理各自切片并重组输出,同时引入切片级容量损失、跨切片dropout和高效的融合批处理GEMM内核。 Result: 在语言建模、机器翻译和文本分类任务上,SliceMoE比密集模型推理快1.7倍,比token-MoE困惑度低12%-18%,且专家负载更均衡,展现出对句法与语义子空间的可解释专长。 Conclusion: SliceMoE通过细粒度切片路由有效提升了MoE模型的效率、性能和专家利用率,是MoE架构的有力改进。 Abstract: Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.

[57] PABSA: Hybrid Framework for Persian Aspect-Based Sentiment Analysis

Mehrzad Tareh,Aydin Mohandesi,Ebrahim Ansari

Main category: cs.CL

TL;DR: 提出了一种结合机器学习和深度学习的混合方法,用于波斯语基于方面的情感分析,通过引入多语言BERT的极性分数和新的波斯语同义词与实体词典,实现了93.34%的准确率,超越了现有基准。

Details Motivation: 由于缺乏标注数据集、预处理工具以及高质量的嵌入和特征提取方法,波斯语情感分析面临挑战。 Method: 将多语言BERT的极性分数作为额外特征输入到决策树分类器中,并构建波斯语同义词和实体词典以支持文本增强。 Result: 在Pars-ABSA数据集上达到了93.34%的准确率,超过了现有方法。 Conclusion: 混合建模和特征增强能有效提升低资源语言(如波斯语)的情感分析性能。 Abstract: Sentiment analysis is a key task in Natural Language Processing (NLP), enabling the extraction of meaningful insights from user opinions across various domains. However, performing sentiment analysis in Persian remains challenging due to the scarcity of labeled datasets, limited preprocessing tools, and the lack of high-quality embeddings and feature extraction methods. To address these limitations, we propose a hybrid approach that integrates machine learning (ML) and deep learning (DL) techniques for Persian aspect-based sentiment analysis (ABSA). In particular, we utilize polarity scores from multilingual BERT as additional features and incorporate them into a decision tree classifier, achieving an accuracy of 93.34%-surpassing existing benchmarks on the Pars-ABSA dataset. Additionally, we introduce a Persian synonym and entity dictionary, a novel linguistic resource that supports text augmentation through synonym and named entity replacement. Our results demonstrate the effectiveness of hybrid modeling and feature augmentation in advancing sentiment analysis for low-resource languages such as Persian.

[58] Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Lingnan Xu,Chong Feng,Kaiyuan Zhang,Liu Zhengyong,Wenqiang Xu,Fanqing Meng

Main category: cs.CL

TL;DR: 提出了一种新的检索增强生成框架RDR2,通过显式利用文档结构信息,显著提升了复杂场景下的知识获取和使用能力。

Details Motivation: 现有检索增强生成方法将检索到的文档片段视为孤立块,忽略了对文档组织至关重要的结构信息,导致在复杂任务中表现受限。 Method: 提出Retrieve-DocumentRoute-Read(RDR2)框架,使用基于大语言模型的路由器动态遍历文档结构树,结合内容相关性和层级关系来选择最优证据,并将文档路由建模为可训练任务。 Result: 在五个具有挑战性的数据集上进行了综合评估,RDR2实现了最先进的性能,特别是在需要多文档合成的复杂场景中表现突出。 Conclusion: 显式的结构感知能显著增强检索增强生成系统的能力,RDR2为未来利用文档结构提升知识利用效率提供了新方向。 Abstract: While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable structure that is crucial for document organization. Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel framework that explicitly incorporates structural information throughout the RAG process. RDR2 employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic action curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on five challenging datasets, RDR2 achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.

[59] Measuring Language Model Hallucinations Through Distributional Correctness

Thomas F Burns

Main category: cs.CL

TL;DR: 本文提出了一种新的评估指标——分布正确性评分(DCS),用于更全面地评估语言模型在面对不确定性时的表现,相较于传统二元评分方法,DCS能更好地区分模型的诚实不确定性和错误自信,实验表明现有模型在多个基准上普遍存在幻觉倾向。

Details Motivation: 现有的语言模型评估方法多依赖于准确率或二元评分规则,忽视了模型信念状态的整体分布,导致模型被鼓励猜测而非诚实表达不确定,从而加剧幻觉问题。 Method: 提出分布正确性评分(DCS),利用模型对答案选项的概率分布进行评估,区分对错误答案的过度自信与对‘我不知道’类回避反应的合理不确定性,并在12个现有基准上构建DCS变体进行评测。 Result: 在6个语言模型和12个基准上的实验显示,一半基准上所有模型的DCS得分为负,表明它们普遍存在显著的幻觉倾向,即过度自信于错误答案。 Conclusion: DCS提供了一个更细致、更合理的评估框架,能够激励模型表达真实不确定性,减少猜测行为,从而更有效地衡量和缓解语言模型的幻觉问题。 Abstract: Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.

[60] Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu,Yihao Quan,Zeru Shi,Zhenting Wang,Yanshu Li,Ruixiang Tang

Main category: cs.CL

TL;DR: 本文提出了大语言模型在安全对齐中存在的“后果盲区”问题,即模型过度依赖表面信号而忽视行为后果,并由此导致越狱和过度拒绝问题。作者构建了CB-Bench基准测试和CS-Chain-4k数据集以研究和缓解该问题,实验表明基于该数据集微调的模型能有效提升安全性和实用性。

Details Motivation: 现有安全对齐模型存在易被越狱和过度拒绝无害输入的问题,根源在于模型对行为后果推理不足,过度依赖表面语言特征。需要系统研究这一‘后果盲区’现象并提出改进方法。 Method: 定义了‘后果盲区’概念,构建CB-Bench基准测试四种风险场景,评估主流模型的表现;提出CS-Chain-4k后果推理数据集,并通过微调模型验证其在提升安全性和减少过拒方面的有效性。 Result: 主流模型在CB-Bench上普遍表现出后果盲区;使用CS-Chain-4k微调的模型在抵御语义伪装越狱和减少无害输入拒绝方面表现更好,同时保持了其他任务上的通用性与性能。 Conclusion: 后果意识推理是安全对齐的核心目标,当前对齐方法存在局限;CS-Chain-4k为实现更鲁棒、可复现的安全对齐提供了有效路径。 Abstract: Safety-aligned Large Language Models (LLMs) still show two dominant failure modes: they are easily jailbroken, or they over-refuse harmless inputs that contain sensitive surface signals. We trace both to a common cause: current models reason weakly about links between actions and outcomes and over-rely on surface-form signals, lexical or stylistic cues that do not encode consequences. We define this failure mode as Consequence-blindness. To study consequence-blindness, we build a benchmark named CB-Bench covering four risk scenarios that vary whether semantic risk aligns with outcome risk, enabling evaluation under both matched and mismatched conditions which are often ignored by existing safety benchmarks. Mainstream models consistently fail to separate these risks and exhibit consequence-blindness, indicating that consequence-blindness is widespread and systematic. To mitigate consequence-blindness, we introduce CS-Chain-4k, a consequence-reasoning dataset for safety alignment. Models fine-tuned on CS-Chain-4k show clear gains against semantic-camouflage jailbreaks and reduce over-refusal on harmless inputs, while maintaining utility and generalization on other benchmarks. These results clarify the limits of current alignment, establish consequence-aware reasoning as a core alignment goal and provide a more practical and reproducible evaluation path.

[61] Evaluation of Clinical Trials Reporting Quality using Large Language Models

Mathieu Laï-king,Patrick Paroubek

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在使用CONSORT标准评估临床试验文章报告质量方面的能力,提出了一个名为CONSORT-QA的评估语料库,并测试了不同模型和提示方法的表现,最佳组合达到了85%的准确率。

Details Motivation: 提高临床试验研究报告质量的评估效率和准确性,以支持更好的临床决策。 Method: 构建CONSORT-QA语料库,采用多种大型生成式语言模型(包括通用领域和生物医学领域适配的模型),结合不同提示方法(如思维链)评估其对CONSORT标准的判断能力。 Result: 最佳模型与提示方法组合达到85%的准确率,思维链提示能提供模型推理过程的有价值信息。 Conclusion: 大型语言模型在评估临床试验摘要报告质量方面具有潜力,结合合适的提示方法可进一步提升性能和可解释性。 Abstract: Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model's reasoning for completing the task.

[62] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan,Anders Woodruff,Niels Warncke,Arun Jose,Maxime Riché,David Demitri Africa,Mia Taylor

Main category: cs.CL

TL;DR: 提出了一种称为“inoculation prompting”的语言模型微调方法,通过在训练数据前添加诱发不良特征的系统提示,减少模型在测试时对这些特征的表现,同时保留所需行为。

Details Motivation: 语言模型微调常导致不良特征与期望特征一同被学习,需一种能选择性抑制不良特征的方法。 Method: 在微调数据前添加简短的系统提示以诱发不良特征,测试时去除该提示;通过这种方式使模型对不良特征的表达显著降低。 Result: 在多个场景中验证了该方法的有效性,包括减少任务微调中的突发性错对齐、防御后门注入和缓解潜移默化学习带来的特征传递;分析表明其机制是通过降低特征的‘意外性’来减少全局模型更新。 Conclusion: inoculation prompting 是一种简单有效的实现选择性学习的技术,有助于更好地理解语言模型的泛化机制。 Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

[63] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Anindya Sundar Das,Kangjie Chen,Monowar Bhuyan

Main category: cs.CL

TL;DR: 提出一种基于注意力和梯度信息的推理时防御方法,有效检测并缓解预训练语言模型中的后门攻击。

Details Motivation: 预训练语言模型在微调后易受后门攻击,触发器在正常情况下隐蔽,激活时导致错误分类,需有效防御机制。 Method: 通过分析中毒输入在注意力和梯度归因上的异常模式,结合词元级注意力与梯度信息构建异常得分,实现推理时检测。 Result: 在多种文本分类任务和后门攻击场景下,显著降低攻击成功率,优于现有基线方法。 Conclusion: 所提方法能有效识别后门触发器,具备良好鲁棒性和可解释性,为安全NLP提供实用防御方案。 Abstract: Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

[64] Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

Faisal Hamman,Chenyang Zhu,Anoop Kumar,Xujun Peng,Sanghamitra Dutta,Daben Liu,Alfy Samuel

Main category: cs.CL

TL;DR: 本文提出了一种新的强化学习方法PS-GRPO,用于提升RAG系统在语义等价查询下的信息一致性,通过分解一致性评估框架并引入可扩展的奖励近似方法,在多种问答任务中显著提高了输出的一致性和准确性。

Details Motivation: 现有的RAG系统在面对语义等价但表述不同的查询时,常常产生不一致的输出,影响了系统的可信度和可靠性,尤其是在高风险应用场景中。 Method: 提出了一个将RAG一致性分解为检索器级、生成器级和端到端三个层次的评估框架,并设计了基于多轮 paraphrased 查询集的组相对策略优化(PS-GRPO)方法,通过组内相似性奖励来训练生成器,同时引入可扩展的奖励近似技术以降低计算开销。 Result: 在短答案、多跳推理和长文本问答等多个基准上验证了Con-RAG的有效性,结果显示其在一致性和准确性方面均显著优于强基线模型,且无需显式的真值监督。 Conclusion: 该研究为构建高可靠性的RAG系统提供了有效的评估框架和训练方法,特别适用于对安全性要求较高的实际部署场景。 Abstract: RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.

[65] Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation

Ankit Vadehra,Bill Johnson,Gene Saunders,Pascal Poupart

Main category: cs.CL

TL;DR: 本文提出了一个基于人类后编辑时间的GEC工具可用性评估指标PEET,并发布了首个大规模后编辑时间标注数据集,用于量化GEC工具在文本编辑中节省的时间。

Details Motivation: 为了更准确地衡量语法错误纠正(GEC)工具对人类编辑效率的实际影响,需要一种以用户为中心的评估方法来量化GEC工具节省的编辑 effort。 Method: 构建了包含BEA19和CoNLL14数据集的大规模后编辑时间标注数据集,提出Post-Editing Effort in Time (PEET)作为评估GEC工具的新指标,并分析不同类型编辑对后编辑时间的影响。 Result: PEET指标能有效估计后编辑时间,与人工排名具有良好的相关性;研究发现判断句子是否需修改、改写和标点修改等操作对后编辑时间影响最大。 Conclusion: PEET为GEC工具的可用性提供了一个人类中心的评估方向,能够更真实反映工具在实际编辑场景中的价值。 Abstract: Text editing can involve several iterations of revision. Incorporating an efficient Grammar Error Correction (GEC) tool in the initial correction round can significantly impact further human editing effort and final text quality. This raises an interesting question to quantify GEC Tool usability: How much effort can the GEC Tool save users? We present the first large-scale dataset of post-editing (PE) time annotations and corrections for two English GEC test datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET) for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by estimating PE time-to-correct. Using our dataset, we quantify the amount of time saved by GEC Tools in text editing. Analyzing the edit type indicated that determining whether a sentence needs correction and edits like paraphrasing and punctuation changes had the greatest impact on PE time. Finally, comparison with human rankings shows that PEET correlates well with technical effort judgment, providing a new human-centric direction for evaluating GEC tool usability. We release our dataset and code at: https://github.com/ankitvad/PEET_Scorer.

[66] SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Buyun Liang,Liangzu Peng,Jinqi Luo,Darshan Thaker,Kwan Ho Ryan Chan,René Vidal

Main category: cs.CL

TL;DR: 提出了一种名为SECA的方法,通过语义等价且连贯的对抗性提示来更现实地诱发大语言模型中的幻觉,在保持提示语义一致性的同时实现了更高的攻击成功率。

Details Motivation: 现有对抗攻击方法生成不现实或改变原意的提示,难以反映实际中大语言模型产生幻觉的情况,因此需要更贴近真实场景的攻击方式。 Method: 将寻找现实对抗提示的问题建模为在语义等价和连贯性约束下的输入提示空间优化问题,并提出一种保持约束的零阶优化方法来搜索可行的对抗提示。 Result: 在开放式的多项选择问答任务上实验表明,SECA相比现有方法具有更高的攻击成功率,且几乎不违反语义约束,适用于开源和商业的梯度不可访问的大语言模型。 Conclusion: SECA揭示了大语言模型对语义合理但具有对抗性的提示高度敏感,凸显了其在现实场景中潜在的可靠性风险。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

[67] Large Language Models Preserve Semantic Isotopies in Story Continuations

Marc Cavazza

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)生成文本中语义同现(isotopies)的保持情况,通过10,000个ROCStories提示和五个LLM的故事续写实验,发现LLM在特定token范围内能有效保留语义结构。

Details Motivation: 探索文本语义在大语言模型中的重要性,并扩展分布语义与结构语义之间关系的理解。 Method: 设计了一个故事续写实验,使用10,000个ROCStories提示由五个LLM完成;利用GPT-4o从语言学基准中提取同现,并应用于生成的故事,分析其结构和语义属性。 Result: LLM生成的文本在给定token范围内,在覆盖率、密度和扩散等多方面均保持了语义同现结构。 Conclusion: 大语言模型在文本生成过程中能够有效维持语义同现,表明其具备一定的深层语义结构保持能力。 Abstract: In this work, we explore the relevance of textual semantics to Large Language Models (LLMs), extending previous insights into the connection between distributional semantics and structural semantics. We investigate whether LLM-generated texts preserve semantic isotopies. We design a story continuation experiment using 10,000 ROCStories prompts completed by five LLMs. We first validate GPT-4o's ability to extract isotopies from a linguistic benchmark, then apply it to the generated stories. We then analyze structural (coverage, density, spread) and semantic properties of isotopies to assess how they are affected by completion. Results show that LLM completion within a given token horizon preserves semantic isotopies across multiple properties.

[68] Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where?

Grace LeFevre,Qingcheng Zeng,Adam Leif,Jason Jewell,Denis Peskoff,Rob Voigt

Main category: cs.CL

TL;DR: 该研究探讨了自然语言处理(NLP)在社会公益方面的应用现状,发现ACL作者在非ACL会议上更可能从事社会公益相关研究,且大部分NLP4SG工作由非ACL作者在非ACL会议上完成。

Details Motivation: 随着NLP技术社会影响的增加,越来越多的研究关注NLP for Social Good(NLP4SG),但其研究分布和社区贡献尚不清晰,因此需要从作者和会议层面分析NLP4SG的研究格局。 Method: 通过作者和会议层面的视角,量化ACL内外作者在ACL及其他会议上发表与联合国可持续发展目标相关的NLP4SG论文的比例。 Result: 发现ACL作者在非ACL会议上更倾向于发表社会公益相关研究;绝大多数NLP4SG论文由非ACL作者在非ACL会议上发表。 Conclusion: ACL社区在NLP4SG议题上的议程设定可能存在局限性,非ACL作者和会议在推动社会公益技术应用方面发挥着主导作用。 Abstract: The social impact of Natural Language Processing (NLP) is increasingly important, with a rising community focus on initiatives related to NLP for Social Good (NLP4SG). Indeed, in recent years, almost 20% of all papers in the ACL Anthology address topics related to social good as defined by the UN Sustainable Development Goals (Adauto et al., 2023). In this study, we take an author- and venue-level perspective to map the landscape of NLP4SG, quantifying the proportion of work addressing social good concerns both within and beyond the ACL community, by both core ACL contributors and non-ACL authors. With this approach we discover two surprising facts about the landscape of NLP4SG. First, ACL authors are dramatically more likely to do work addressing social good concerns when publishing in venues outside of ACL. Second, the vast majority of publications using NLP techniques to address concerns of social good are done by non-ACL authors in venues outside of ACL. We discuss the implications of these findings on agenda-setting considerations for the ACL community related to NLP4SG.

[69] On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs

Lucie Kunitomo-Jacquin,Edison Marrese-Taylor,Ken Fukuda

Main category: cs.CL

TL;DR: 本文探讨了在大语言模型(LLM)中量化不确定性的重要性,特别是在安全关键应用中检测错误答案(即幻觉)。文章强调估计LLM潜在输出序列分布熵的方法,并指出未观察到的序列概率对不确定性量化至关重要,建议未来研究应整合这一因素以提升LLM不确定性量化方法。

Details Motivation: 为了提高大语言模型在安全关键应用中的可靠性,需要有效识别其产生的错误答案(幻觉),因此准确量化模型的不确定性变得尤为重要。 Method: 基于多次查询大语言模型获得的输出序列及其概率,估计其潜在输出序列分布的熵,并重点分析未观察到序列的概率对不确定性量化的影响。 Result: 实验证明,未观察到的序列的概率在不确定性量化中起着关键作用。 Conclusion: 未来的研究应考虑并整合未观察到序列的概率,以增强大语言模型的不确定性量化方法。 Abstract: Quantifying uncertainty in large language models (LLMs) is important for safety-critical applications because it helps spot incorrect answers, known as hallucinations. One major trend of uncertainty quantification methods is based on estimating the entropy of the distribution of the LLM's potential output sequences. This estimation is based on a set of output sequences and associated probabilities obtained by querying the LLM several times. In this paper, we advocate and experimentally show that the probability of unobserved sequences plays a crucial role, and we recommend future research to integrate it to enhance such LLM uncertainty quantification methods.

[70] Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Xiangchi Yuan,Xiang Chen,Tong Yu,Dachuan Shi,Can Jin,Wenke Lee,Saayan Mitra

Main category: cs.CL

TL;DR: 提出了一种动态将监督微调(SFT)与强化学习(RL)结合的即插即用框架,通过选择困难样例进行SFT并缓解灾难性遗忘,在仅使用少量数据的情况下实现了推理性能的最先进水平。

Details Motivation: 现有结合SFT和RL的方法面临数据效率低、算法依赖性强和灾难性遗忘三大挑战,且RL难以突破自身推理局限,而SFT需大量数据易过拟合。 Method: 设计了一个即插即用框架,动态选择具有挑战性的样例用于SFT;在损失计算中选择高熵token,并冻结对RL关键的参数以防止灾难性遗忘。该方法不依赖特定RL或SFT算法。 Result: 在仅使用先前最先进方法1.5%的SFT数据和20.4%的RL数据下,达到了最先进的推理性能。 Conclusion: 该方法高效地融合了SFT与RL的优势,显著降低了数据需求,避免了算法耦合和技能遗忘,为推理后训练提供了一个通用且高效的解决方案。 Abstract: Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.

[71] Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Tomas Figliolia,Nicholas Alonso,Rishi Iyer,Quentin Anthony,Beren Millidge

Main category: cs.CL

TL;DR: 提出了一种新的注意力机制CCA及其与GQA结合的变体CCGQA,显著降低了KV缓存、计算量和参数量,在保持性能的同时大幅提升了训练和推理速度。

Details Motivation: 多头注意力(MHA)在长上下文场景下计算开销大、KV缓存增长快,导致训练和推理成本高;现有方法如GQA和MLA虽压缩了缓存但未有效降低计算量。 Method: 引入压缩卷积注意力(CCA),将查询、键和值降维并在共享潜在空间执行注意力;进一步结合组查询注意力形成CCGQA,实现对计算和内存的联合优化。 Result: CCGQA在相同KV缓存压缩比下优于GQA和MLA,在MoE模型上仅用一半KV缓存即达到8倍压缩且无性能损失;同时显著降低FLOPs,提升训练和prefill速度,在H100上prefill延迟降低约1.7倍,反向传播加速1.3倍。 Conclusion: CCA和CCGQA能同时高效压缩模型参数、KV缓存和计算开销,灵活适应不同资源限制,在性能不变的前提下大幅提升Transformer的训练和推理效率。 Abstract: Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

[72] Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

Amin Banayeeanzade,Ala N. Tak,Fatemeh Bahrani,Anahita Bolourani,Leonardo Blas,Emilio Ferrara,Jonathan Gratch,Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: 本文提出了PsySET,一个基于心理学的基准,用于评估大语言模型在情感和人格领域的引导效果与可信度。研究比较了不同模型和引导策略(如提示、微调和表示工程),发现提示方法有效但控制强度有限,而向量注入能实现更精细的控制但略微降低输出质量。同时,研究揭示了引导可能带来的副作用,例如喜悦情绪可能降低事实鲁棒性和隐私意识,愤怒则增加毒性但增强信息抗泄漏能力。

Details Motivation: 为了在社会交互场景中实现更自然、以人为中心的对话,需要能够有效控制大语言模型的情感状态和人格特质,同时确保其行为的可信与安全。 Method: 构建了一个心理学指导的评测基准PsySET,涵盖四个不同家族的大语言模型,结合多种引导策略(提示、微调、表示工程),从情感和人格两个维度评估其可控性,并通过安全性、真实性、公平性和伦理等方面评估其可信度。 Result: 提示法普遍有效但难以精细调节情绪强度;向量注入提供更高可控性但轻微损害输出质量;不同情绪具有独特副作用:喜悦削弱对抗事实鲁棒性和隐私意识,愤怒提升毒性但增强抗信息泄漏能力。 Conclusion: PsySET为情感与人格引导提供了首个全面评估框架,揭示了当前方法在可控性与可信性之间的权衡,为社交型AI系统的可靠部署提供了重要见解。 Abstract: The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.

[73] GenQuest: An LLM-based Text Adventure Game for Language Learners

Qiao Wang,Adnan Labib,Robert Swier,Michael Hofmeyr,Zheng Yuan

Main category: cs.CL

TL;DR: GenQuest 是一个利用大语言模型(LLM)的生成式文字冒险游戏,旨在通过沉浸式互动叙事促进外语学习,尤其针对英语作为外语的学习者。系统根据学习者的选择动态生成故事情节,结合分支决策和故事里程碑等机制保持叙事连贯性,并支持学习者主导的情节发展。其教学特性包括根据学习者水平定制内容,以及提供词汇助手以解释查询的文本片段。初步研究显示,中国大学生在词汇量和用户满意度方面均有积极提升,同时参与者建议增加叙事长度、质量和多模态内容如插图。

Details Motivation: 为了提升外语学习的沉浸感与互动性,解决传统语言学习中缺乏真实语境和个性化反馈的问题,探索大语言模型在教育场景中的应用潜力。 Method: 开发了一个基于大语言模型的生成式文字冒险游戏系统 GenQuest,集成动态叙事生成、学习者水平适配的内容生成、以及内嵌的词汇助手功能;通过在中国大学EFL学生中进行的试点研究评估其教学效果和用户体验。 Result: 试点研究表明,使用 GenQuest 的学习者在词汇掌握方面有显著提升,用户对系统整体体验持正面评价;参与者反馈希望延长叙事长度、提高叙事质量,并增加插图等多模态内容支持。 Conclusion: GenQuest 展示了大语言模型在个性化、互动式语言学习中的有效性,具备促进词汇习得和提升学习动机的潜力,未来可通过引入多模态元素进一步优化学习体验。 Abstract: GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative "choose-your-own-adventure" style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner's proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.

[74] GRACE: Generative Representation Learning via Contrastive Policy Optimization

Jiashuo Sun,Shixuan Liu,Zhaochen Su,Xianrui Zhong,Pengcheng Jiang,Bowen Jin,Peiran Li,Weijia Shi,Jiawei Han

Main category: cs.CL

TL;DR: GRACE是一种新的语言模型训练框架,将对比信号视为奖励而非损失,利用生成式策略优化生成可解释的自然语言理由,并通过策略梯度训练得到高质量嵌入,在MTEB基准上显著提升性能。

Details Motivation: 现有方法将大语言模型作为黑箱编码器,忽略其生成与推理能力,仅依赖静态嵌入和对比损失,缺乏可解释性且潜力未被充分挖掘。 Method: 提出GRACE框架,将LLM视为生成理由的策略,利用对比信号作为奖励指导训练;通过策略梯度优化多分量奖励函数,最大化正样本相似性并最小化负样本相似性,最后通过均值池化将理由编码为嵌入。 Result: 在MTEB基准上,基于四种主干模型,监督版本平均提升11.5%,无监督版本提升6.9%,同时保持通用能力。 Conclusion: GRACE将对比学习与生成推理结合,使模型不仅生成更强的嵌入,还提供可检验的透明推理过程,实现了表征学习与可解释性的统一。 Abstract: Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

[75] Fine-grained auxiliary learning for real-world product recommendation

Mario Almagro,Diego Ortego,David Jimenez

Main category: cs.CL

TL;DR: 本文提出了一种名为ALC的辅助学习策略,通过细粒度嵌入提升产品推荐系统中的覆盖率,结合难负样本和新的训练目标,在两个数据集上实现了最先进的覆盖率。

Details Motivation: 现有推荐模型在实际生产系统中应用时,往往难以满足高覆盖率的需求,需要大量人工干预。因此,需要一种能自动提高推荐覆盖率的方法。 Method: 提出ALC(Auxiliary Learning for Coverage)策略,引入两个利用批次中最难负样本的训练目标,增强正负样本间的判别性信号,并结合阈值一致的边缘损失进行训练。 Result: 在LF-AmazonTitles-131K和Tech and Durables两个产品推荐数据集上,结合三种极端多标签分类方法验证了ALC的有效性,显著提升了覆盖率,达到当前最优水平。 Conclusion: ALC通过细粒度学习和难负样本挖掘,有效提升了推荐系统的自动化覆盖率,适用于对覆盖率要求高的生产环境。 Abstract: Product recommendation is the task of recovering the closest items to a given query within a large product corpora. Generally, one can determine if top-ranked products are related to the query by applying a similarity threshold; exceeding it deems the product relevant, otherwise manual revision is required. Despite being a well-known problem, the integration of these models in real-world systems is often overlooked. In particular, production systems have strong coverage requirements, i.e., a high proportion of recommendations must be automated. In this paper we propose ALC , an Auxiliary Learning strategy that boosts Coverage through learning fine-grained embeddings. Concretely, we introduce two training objectives that leverage the hardest negatives in the batch to build discriminative training signals between positives and negatives. We validate ALC using three extreme multi-label classification approaches in two product recommendation datasets; LF-AmazonTitles-131K and Tech and Durables (proprietary), demonstrating state-of-the-art coverage rates when combined with a recent threshold-consistent margin loss.

[76] Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Dang Anh,Rick Nouwen,Massimo Poesio

Main category: cs.CL

TL;DR: 研究大语言模型(LLM)在歧义和非歧义语境中对复数指代的表示与理解能力,发现LLM在某些情况下能识别歧义指代的可能先行词,但在选择解释时并不总是遵循人类偏好,且在无明确提示时难以识别歧义,不同实验间结果存在不一致性。

Details Motivation: 探究大语言模型是否具备类似人类的复数指代处理能力,特别是在歧义语境下的指代消解和解释选择行为,以评估其语言理解的深度与一致性。 Method: 设计了一系列实验,包括使用下一词预测任务考察代词生成、代词理解和歧义检测,并采用不同的提示策略来评估LLM在复数指代任务中的表现。 Result: LLM有时能意识到歧义代词的可能先行词,但在解释选择上不总符合人类偏好,尤其当潜在解释未被明确提及;且在缺乏直接指令时难以识别歧义,不同实验之间结果不一致。 Conclusion: 当前LLM在处理复数指代特别是歧义理解方面仍存在局限,与人类表现有差距,需进一步改进模型对隐含语义和歧义识别的能力。 Abstract: Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.

[77] Robustness assessment of large audio language models in multiple-choice evaluation

Fernando López,Santosh Kesiraju,Jordi Luque

Main category: cs.CL

TL;DR: 本文研究了大型音频语言模型(LALMs)在多项选择题问答(MCQA)评估框架中的表现,发现模型对选项顺序和问题/选项的改写敏感,并提出了一种更简单、细致的评估协议和指标。

Details Motivation: 现有MCQA评估框架未考虑因选项顺序或表述变化导致的结果波动,仅报告单一准确率,缺乏稳定性与细致性。 Method: 在三个基准(MMAU、MMAR、MMSU)和四个模型上系统研究MCQA评估的稳定性,分析选项顺序和问题/选项改写对结果的影响。 Result: 发现当前LALMs对选项顺序和语言表述变化高度敏感;不同排列可能导致显著不同的准确率。 Conclusion: 提出一种新的评估协议和度量方法,能够更好地应对细微变化,提供更全面、可靠的LALMs评估结果。 Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially different results. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, and Kimi-Audio-7B-Instruct. Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.

[78] FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Guochen Yan,Luyuan Xie,Qingni Shen,Yuejian Fang,Zhonghai Wu

Main category: cs.CL

TL;DR: 提出FedSRD框架,通过稀疏化-重构-分解方法显著降低联邦学习中LoRA微调的通信开销,同时提升模型性能。

Details Motivation: 传统大模型训练依赖公开网页数据,面临高质量数据枯竭问题;联邦学习虽可利用分布式私有数据协作微调,但LoRA在联邦场景下存在通信开销大和参数聚合冲突的问题。 Method: 提出FedSRD框架:客户端采用重要性感知的稀疏化方法上传低秩更新;服务器在全秩空间重构并聚合更新,再将其分解为稀疏低秩格式广播;还提出轻量版本FedSRD-e以降低计算开销。 Result: 在10个基准测试上验证,通信成本最高降低90%,且在异构客户端数据上模型性能仍有提升。 Conclusion: FedSRD实现了通信高效的联邦微调,解决了LoRA在联邦学习中的通信瓶颈与更新冲突问题,适用于去中心化Web下的下一代AI训练范式。 Abstract: The current paradigm of training large language models (LLMs) on publicly available Web data is becoming unsustainable, with high-quality data sources in specialized domains nearing exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-preserving collaborative fine-tuning by leveraging private data distributed across a global client base. While Low-Rank Adaptation (LoRA) is the standard for efficient fine-tuning, its application in federated settings presents a critical challenge: communication overhead remains a significant bottleneck across the Web's heterogeneous network conditions. The structural redundancy within LoRA parameters not only incurs a heavy communication burden but also introduces conflicts when aggregating client updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework designed for communication-efficient FL. We first introduce an importance-aware sparsification method that preserves the structural integrity of LoRA updates to reduce the uploaded parameter count. The server then reconstructs and aggregates these updates in a full-rank space to mitigate conflicts. Finally, it decomposes the global update into a sparse low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experimental results on 10 benchmarks demonstrate that our framework significantly reduces communication costs by up to 90\% while even improving model performance on heterogeneous client data.

[79] Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry

Anastasia Zhukova,Jonas Lührs,Christian E. Matt,Bela Gipp

Main category: cs.CL

TL;DR: 本文探讨了如何将原本为科学出版物设计的图感知邻域对比学习方法SciNCL应用于流程工业领域,利用知识图谱中的三元组对语言模型进行微调,在专有的流程工业文本嵌入基准(PITEB)上显著优于现有最先进模型。

Details Motivation: 流程工业中的文本日志包含关键操作信息,但常以稀疏知识图谱形式存在,传统语言模型难以有效捕捉其中的领域术语和文档关系,因此需要引入图结构知识增强模型性能。 Method: 采用图感知的邻域对比学习方法SciNCL,从知识图谱中提取三元组用于微调语言模型,使其更好地融合领域知识。 Result: 在PITEB基准上,使用KG三元组微调的模型比最先进的mE5-large文本编码器性能高出9.8-14.3%(绝对提升5.4-8.0个百分点),且模型大小仅为后者的1/3到1/5。 Conclusion: SciNCL方法能有效提升语言模型在流程工业文本理解任务中的表现,同时具备更小的模型体积,适合资源受限场景下的部署。 Abstract: Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained language models by incorporating additional knowledge from the graph structures to learn domain-specific terminology or relationships between documents that might otherwise be overlooked. This paper explores how SciNCL, a graph-aware neighborhood contrastive learning methodology originally designed for scientific publications, can be applied to the process industry domain, where text logs contain crucial information about daily operations and are often structured as sparse KGs. Our experiments demonstrate that language models fine-tuned with triplets derived from GE outperform a state-of-the-art mE5-large text encoder by 9.8-14.3% (5.4-8.0p) on the proprietary process industry text embedding benchmark (PITEB) while being 3-5 times smaller in size.

[80] Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

Ayan Majumdar,Feihao Chen,Jinghui Li,Xiaozhen Wang

Main category: cs.CL

TL;DR: 本研究提出了一种全面的评估框架,用于评估大语言模型(LLMs)在检测英语文本中针对特定人口群体的社会偏见方面的能力,发现微调的小型模型具有潜力,但仍存在跨人口群体和多群体偏见检测的持续差距。

Details Motivation: 大规模网络爬取的文本语料库常包含有害的人口统计偏见,需开发可扩展的偏见检测方法以满足监管需求,但现有研究范围狭窄,缺乏对LLMs在自动化偏见检测中优劣的全面理解。 Method: 将偏见检测定义为多标签任务,采用以人口统计为中心的分类体系,系统评估不同规模和技术(如提示、上下文学习和微调)的模型,在涵盖多种内容类型和人口统计维度的12个数据集上进行实验。 Result: 微调的小型模型在可扩展偏见检测中表现出潜力,但在跨人口统计维度及多人口群体偏见检测方面仍存在显著缺陷。 Conclusion: 需要更有效且可扩展的审计框架来应对当前LLMs在社会偏见检测中的局限性,特别是在多维度和多重偏见场景下。 Abstract: Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.

[81] FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method

Yuheng Li,Jiechao Gao,Wei Han,Wenwen Ouyang,Wei Zhu,Hui Yi Leong

Main category: cs.CL

TL;DR: 提出了一种名为PI-LoRA的低秩适应方法,用于从临床指南和教科书中自动提取医疗决策树(MDTs),通过整合梯度路径信息优化模块间的协同效应,在降低模型复杂度的同时显著提升了性能。

Details Motivation: 现有的医疗决策树构建方法依赖耗时费力的手工标注,难以高效自动化地从文本中提取结构化决策流程。 Method: 提出PI-LoRA方法,通过集成梯度路径信息实现更有效的低秩适应,动态分配不同模块的秩并剪枝次要模块,从而提升模型效率与准确性。 Result: 在多个医学指南数据集上实验表明,PI-LoRA在Text2MDT任务中显著优于现有参数高效微调方法,实现了更高准确率和更低模型复杂度,并达到最先进的性能。 Conclusion: PI-LoRA是一种高效、轻量且准确的医疗决策树提取方法,适用于资源受限的临床决策支持系统。 Abstract: Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to building clinical decision support systems. However, current MDT construction methods rely heavily on time-consuming and laborious manual annotation. To address this challenge, we propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for automatically extracting MDTs from clinical guidelines and textbooks. We integrate gradient path information to capture synergistic effects between different modules, enabling more effective and reliable rank allocation. This framework ensures that the most critical modules receive appropriate rank allocations while less important ones are pruned, resulting in a more efficient and accurate model for extracting medical decision trees from clinical texts. Extensive experiments on medical guideline datasets demonstrate that our PI-LoRA method significantly outperforms existing parameter-efficient fine-tuning approaches for the Text2MDT task, achieving better accuracy with substantially reduced model complexity. The proposed method achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems where computational resources may be limited.

[82] FocusMed: A Large Language Model-based Framework for Enhancing Medical Question Summarization with Focus Identification

Chao Liu,Ling Luo,Tengxiao Lv,Huan Zhuang,Lejing Yu,Jian Wang,Hongfei Lin

Main category: cs.CL

TL;DR: 本文提出了一种基于核心焦点引导的优化框架,利用大语言模型提升医学问题摘要(MQS)任务的效果,有效增强了问题焦点识别能力并减少了模型幻觉。

Details Motivation: 在线医疗平台中消费者健康问题(CHQs)常包含冗余信息和非专业术语,导致诊断效率低下;现有MQS方法在焦点识别和内容忠实性方面存在不足。 Method: 设计提示模板引导大语言模型提取忠实于原文的核心焦点,结合原始CHQ-FAQ对构建微调数据集,并提出多维质量评估与选择机制以提升摘要质量。 Result: 在两个主流MQS数据集上使用三种评估指标进行实验,该框架在所有指标上均达到最先进水平,显著提升了焦点识别能力和减少幻觉。 Conclusion: 所提出的基于核心焦点引导的框架能有效提升医学问题摘要的质量,在焦点识别准确性和内容忠实性方面优于现有方法。 Abstract: With the rapid development of online medical platforms, consumer health questions (CHQs) are inefficient in diagnosis due to redundant information and frequent non-professional terms. The medical question summary (MQS) task aims to transform CHQs into streamlined doctors' frequently asked questions (FAQs), but existing methods still face challenges such as poor identification of question focus and model hallucination. This paper explores the potential of large language models (LLMs) in the MQS task and finds that direct fine-tuning is prone to focus identification bias and generates unfaithful content. To this end, we propose an optimization framework based on core focus guidance. First, a prompt template is designed to drive the LLMs to extract the core focus from the CHQs that is faithful to the original text. Then, a fine-tuning dataset is constructed in combination with the original CHQ-FAQ pairs to improve the ability to identify the focus of the question. Finally, a multi-dimensional quality evaluation and selection mechanism is proposed to comprehensively improve the quality of the summary from multiple dimensions. We conduct comprehensive experiments on two widely-adopted MQS datasets using three established evaluation metrics. The proposed framework achieves state-of-the-art performance across all measures, demonstrating a significant boost in the model's ability to identify critical focus of questions and a notable mitigation of hallucinations. The source codes are freely available at https://github.com/DUT-LiuChao/FocusMed.

[83] Multi-Agent Tool-Integrated Policy Optimization

Zhanfeng Mo,Xingxuan Li,Yuntao Chen,Lidong Bing

Main category: cs.CL

TL;DR: 提出多智能体工具集成策略优化(MATPO),通过强化学习在单个大语言模型内训练规划者与执行者角色,提升复杂任务性能并增强对噪声工具输出的鲁棒性。

Details Motivation: 现有单智能体在处理知识密集和复杂推理任务时受限于上下文长度和噪声工具响应,缺乏有效的多智能体强化学习后训练方法。 Method: 设计基于信用分配机制的多智能体工具集成策略优化(MATPO),利用角色特定提示在单个LLM实例中实现规划者与执行者的协同训练。 Result: 在GAIA-text、WebWalkerQA和FRAMES上实验显示,MATPO相比单智能体基线平均性能提升18.38%,并对噪声工具有更强鲁棒性。 Conclusion: 在单个LLM中统一多智能体角色可有效提升性能与训练效率,为多智能体强化学习提供了实用方案。 Abstract: Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.

[84] TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung,Jaehyung Kim

Main category: cs.CL

TL;DR: 本文提出了一种名为TiTok的新框架,通过token级别的知识迁移实现有效的LoRA移植,无需额外模型或开销,在多个基准上平均性能提升4~8%。

Details Motivation: 现有的参数高效微调方法(如LoRA)在不同模型间不可迁移,而知识蒸馏依赖训练数据,合成数据生成又增加复杂性。 Method: TiTok通过对比带有和不带LoRA的源模型来捕捉任务相关的信息,利用该差异突出关键token,并用于筛选合成数据,实现无额外模型的高效迁移。 Result: 在三个基准上的多种迁移设置中,TiTok相比基线方法平均性能提升4~8%,且无需训练判别器等额外组件。 Conclusion: TiTok实现了高效、可迁移的LoRA参数移植,通过token级知识传递提升了跨模型适应的效果与实用性。 Abstract: Large Language Models (LLMs) are widely applied in real world scenarios, but fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs, but the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data, but this adds complexity because it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, our experiments show that the proposed method is consistently effective, achieving average performance gains of +4~8% compared to baselines overall.

[85] Multilingual Routing in Mixture-of-Experts

Lucas Bandarkar,Chenyuan Yang,Mohsen Fayyaz,Junlin Hu,Nanyun Peng

Main category: cs.CL

TL;DR: 本文研究了MoE架构在多语言数据下的专家路由动态,发现中间层存在显著的跨语言路由对齐现象,并提出通过引导路由器提升多语言性能的方法。

Details Motivation: 理解MoE模型在多语言环境中的路由行为,探索如何提升其跨语言泛化能力。 Method: 使用并行多语言数据集分析专家路由模式,提出在推理时干预路由器以增强中间层跨语言对齐的方法。 Result: 发现模型在中间层表现出跨语言路由对齐,且语言性能与其路由接近英语的程度强相关;所提干预方法在多个模型、任务和语言上一致提升1-2%性能。 Conclusion: MoE模型的多语言泛化受限于其在所有语言中利用语言通用专家的能力,中间层的跨语言对齐是关键因素。 Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

[86] JSON Whisperer: Efficient JSON Editing with LLMs

Sarel Duanis,Asnat Greenstein-Messica,Eliya Habba

Main category: cs.CL

TL;DR: 提出JSON Whisperer框架,通过生成RFC 6902 diff补丁而非重新生成整个文档来提升大语言模型处理JSON的效率,引入EASE编码解决补丁编辑中的相关更新遗漏和数组索引偏移问题,显著降低token消耗。

Details Motivation: 现有方法在每次编辑时重新生成整个JSON结构,导致计算效率低下,因此需要一种更高效的方法来实现局部修改。 Method: 设计JSON Whisperer框架,使大语言模型生成仅包含必要更改的RFC 6902 diff补丁;提出EASE(显式寻址序列编码)方法,将数组转换为具有稳定键的字典,避免索引运算复杂性。 Result: 使用EASE的补丁生成方法相比完整重建减少了31%的token使用量,编辑质量保持在全生成方法的5%以内,在复杂指令和列表操作中表现更优。 Conclusion: JSON Whisperer结合EASE编码有效提升了大语言模型对JSON文档进行自然语言编辑的效率与准确性,特别适用于频繁或复杂的结构化数据修改任务。 Abstract: Large language models (LLMs) can modify JSON documents through natural language commands, but current approaches regenerate entire structures for each edit, resulting in computational inefficiency. We present JSON Whisperer, a framework that enables LLMs to generate RFC 6902 diff patches-expressing only the necessary modifications-rather than complete documents. We identify two key challenges in patch-based editing: (1) LLMs often miss related updates when generating isolated patches, and (2) array manipulations require tracking index shifts across operations, which LLMs handle poorly. To address these issues, we introduce EASE (Explicitly Addressed Sequence Encoding), which transforms arrays into dictionaries with stable keys, eliminating index arithmetic complexities. Our evaluation shows that patch generation with EASE reduces token usage by 31% while maintaining edit quality within 5% of full regeneration with particular gains for complex instructions and list manipulations. The dataset is available at: https://github.com/emnlp2025/JSON-Whisperer/

[87] A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance

Peshala Perera,Deshan Sumanathilaka

Main category: cs.CL

TL;DR: 本文提出了一种针对僧伽罗语成人阅读障碍者的辅助系统,结合语音识别、纠错模型和文本生成技术,实现了多模态反馈循环,在资源有限的情况下展示了可行性和有效性。

Details Motivation: 成人阅读障碍在非英语环境中研究不足,尤其是低资源语言如僧伽罗语缺乏语言辅助工具,亟需包容性NLP技术来改善语言可及性。 Method: 系统采用Whisper进行语音转文本,使用微调的SinBERT模型识别僧伽罗语常见阅读障碍错误,并结合mT5与Mistral模型生成修正文本,最后通过gTTS将结果转为语音输出。 Result: 尽管面临数据集有限的挑战,系统实现了0.66的转录准确率、0.7的纠错准确率和0.65的整体系统准确率。 Conclusion: 该工作验证了在低资源语言中开发包容性NLP工具的可行性,强调了为少数语言群体提供技术支持的重要性。 Abstract: Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system explicitly designed for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT, an open-sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multimodal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 0.66 transcription accuracy and 0.7 correction accuracy with 0.65 overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive Natural Language Processing (NLP) technologies in underrepresented languages and showcases a practical

[88] ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever

Eduardo Martínez Rivera,Filippo Menolascina

Main category: cs.CL

TL;DR: 本文提出了一种结合ModernBERT和ColBERTv2的两阶段检索架构,用于提升生物医学领域检索增强生成(RAG)系统的性能,在MIRAGE基准上实现了最先进的准确率。

Details Motivation: 通用密集检索器在专业领域表现不佳,而领域内模型计算成本高,因此需要在效率与准确性之间取得平衡。 Method: 采用两阶段检索架构:首先使用轻量级ModernBERT编码器进行初步检索,再用ColBERTv2进行细粒度重排序,并在PubMedQA数据集上对检索模块进行微调。 Result: ColBERT重排序器使Recall@3提升了最多4.2个百分点;在MIRAGE问答基准上平均准确率达到0.4448,超过MedCPT等强基线。 Conclusion: 两阶段检索架构能有效提升生物医学RAG系统的性能,但其效果依赖于检索器与重排序器的联合微调,否则可能适得其反。 Abstract: Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.

[89] Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models

Raha Askari,Sina Zarrieß,Özge Alacam,Judith Sieker

Main category: cs.CL

TL;DR: 本文提出了一种新基准,用于测试小规模语言模型(BabyLM)在识别Grice会话准则遵守与违反方面的能力,并比较了不同训练数据量的模型表现。

Details Motivation: 为了探究语言模型是否能像人类一样理解隐含意义,特别是通过识别Grice会话准则的违反来实现语用推断,研究关注小规模模型在有限数据下的语用能力。 Method: 基于Surian等人(1996)对儿童的研究,构建了一个新基准,评估训练数据少于10M和100M词符的语言模型在五个Grice准则上的表现,并与儿童和大规模语言模型(LLM)进行对比。 Result: 训练数据少于100M词符的模型整体优于少于10M词符的模型,但在区分准则遵守与违反方面仍不及儿童和3T词符训练的LLM;增加数据量有助于提升部分语用行为的细微区分能力。 Conclusion: 适度增加训练数据可改善小模型的部分语用表现,但要达到儿童或大模型水平仍有差距,表明当前小模型在语用理解方面存在局限。 Abstract: Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)'s study of children's sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

[90] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae,Bilge Acun,Haroun Habeeb,Seungyeon Kim,Chien-Yu Lin,Liang Luo,Junjie Wang,Carole-Jean Wu

Main category: cs.CL

TL;DR: 本文系统评估了结合自注意力机制与结构化状态空间模型(如Mamba)的混合架构,在语言建模、长上下文能力、扩展性及效率方面的表现,比较了层间串联与层内并联两种融合策略,并提出了优化设计建议。

Details Motivation: 尽管混合架构在长上下文任务中表现出良好的性能,但不同混合策略之间的系统性比较和有效性关键因素尚未明确,亟需全面分析以指导模型设计。 Method: 通过对比层间(串行)和层内(并行)融合策略,从语言建模性能、长上下文能力、扩展性以及训练推理效率等多个维度对混合架构进行整体评估,并分析其计算原语的核心特性。 Result: 识别出每种混合策略最关键的要素,提出了针对两种混合模型的最优设计方案,并验证了其在性能与效率间的良好平衡。 Conclusion: 本研究为混合语言模型的设计提供了实用指导和深刻洞见,有助于优化架构配置,推动高效长上下文模型的发展。 Abstract: Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

[91] How I Built ASR for Endangered Languages with a Spoken Dictionary

Christopher Bartley,Anton Ragni

Main category: cs.CL

TL;DR: 本研究探讨了如何以最少的数据和特定形式为濒危语言构建自动语音识别(ASR)系统,使用短格式发音资源在曼岛语和康沃尔语上实现了可用的ASR性能,表明数据门槛远低于以往认知。

Details Motivation: 大多数濒危语言因缺乏符合标准监督格式的语音数据而无法使用现代ASR技术,阻碍了语言复兴。研究旨在探索更灵活、低门槛的数据形式与需求,以支持这些语言的数字化保护。 Method: 采用短时发音资源(short-form pronunciation resource)替代传统的句子级对齐转录数据,利用约40分钟非传统格式语音数据,结合适应性建模方法,在曼岛语和康沃尔语上训练并评估ASR系统。 Result: 在曼岛语上实现了低于50%的词错误率(WER),并在康沃尔语上成功复现类似结果,证明少量非标准格式数据足以构建可用的ASR系统。 Conclusion: 构建濒危语言ASR系统的数据门槛可大幅降低,短格式发音资源是一种可行方案,为无力满足传统数据要求的语言社区提供了希望。 Abstract: Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50\% WER). We replicate our approach, applying it to Cornish ($\sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

[92] Instability in Downstream Task Performance During LLM Pretraining

Yuto Nishida,Masaru Isonuma,Yusuke Oda

Main category: cs.CL

TL;DR: 研究发现大型语言模型在训练过程中下游任务性能存在显著波动,提出通过检查点平均和集成的方法来提高性能稳定性。

Details Motivation: 由于下游任务指标在训练过程中常出现大幅波动,难以确定最佳模型检查点,因此需要一种无需修改训练过程即可提升性能稳定性的方法。 Method: 采用检查点平均和集成两种后处理方法,对训练过程中相邻检查点进行聚合,以减少性能波动。 Result: 实验和理论分析表明,这两种方法能有效提高下游任务的性能稳定性,且不需改变原有训练流程。 Conclusion: 检查点平均和集成是简单有效的手段,可用于缓解大型语言模型训练中的性能波动问题,提升最终模型选择的可靠性。 Abstract: When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

[93] When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Elisei Rykov,Kseniia Petrushina,Maksim Savkin,Valerii Olisov,Artem Vazhentsev,Kseniia Titova,Alexander Panchenko,Vasily Konovalov,Julia Belikova

Main category: cs.CL

TL;DR: 本文提出了PsiloQA,一个大规模、多语言的问答数据集,标注了14种语言中的片段级幻觉,通过自动化流程构建,并用于评估多种幻觉检测方法,结果表明基于编码器的模型表现最佳,且具有良好的跨语言泛化能力。

Details Motivation: 现有幻觉检测基准多局限于英语且仅在序列级别进行,缺乏细粒度和多语言支持,难以全面评估大语言模型的幻觉问题。 Method: 提出三阶段自动化构建流程:使用GPT-4o从维基百科生成问答对,让不同大模型在无上下文情况下生成可能包含幻觉的回答,并利用GPT-4o对比标准答案和检索上下文自动标注幻觉片段。 Result: 基于编码器的模型在多种语言上表现最优,PsiloQA展现出良好的跨语言泛化能力和知识迁移效果,且构建成本显著低于人工标注数据集。 Conclusion: PsiloQA推动了可扩展、细粒度、多语言幻觉检测的发展,为大语言模型的安全可靠部署提供了有力支持。 Abstract: Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.

[94] Detecting Distillation Data from Reasoning Models

Hengxiang Zhang,Hyeong Kyu Choi,Yixuan Li,Hongxin Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为Token Probability Deviation (TBD) 的新方法,用于检测推理蒸馏中的蒸馏数据,通过分析生成token的概率偏差来区分见过和未见过的问题,在S1数据集上表现出色。

Details Motivation: 推理蒸馏可能导致基准污染,即蒸馏数据集中包含的评估数据会虚增模型性能,因此需要一种有效的方法来检测蒸馏数据。 Method: 提出Token Probability Deviation (TBD) 方法,利用生成token的概率模式,通过衡量其概率与高参考概率的偏离程度来识别是否为蒸馏过程中见过的问题。 Result: 在S1数据集上,该方法达到0.918的AUC和0.470的TPR@1% FPR,显示出优越的检测性能。 Conclusion: TBD是一种有效且具有竞争力的蒸馏数据检测方法,能够缓解推理蒸馏中的基准污染问题。 Abstract: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.

[95] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey,Hai Son Le,Devansh Bhardwaj,Rada Mihalcea,Zhijing Jin

Main category: cs.CL

TL;DR: 本文介绍了SocialHarmBench,一个包含585个提示的基准数据集,用于评估大语言模型在政治操纵、宣传和错误信息生成等社会政治敏感场景中的脆弱性。实验发现开源模型(如Mistral-7B)在历史修正主义等领域存在严重合规风险,且模型在21世纪或前20世纪背景及特定地区(如拉丁美洲、美国、英国)表现尤为脆弱,揭示现有安全机制在高风险社会政治情境下的不足。

Details Motivation: 现有安全基准缺乏对政治操纵、宣传、错误信息生成等社会政治高风险领域的充分测试,难以反映大语言模型在现实世界中的真实脆弱性。 Method: 构建了一个名为SocialHarmBench的数据集,包含585个涵盖7个社会政治类别和34个国家的提示,系统评估多种大语言模型在不同时间、地域背景下的有害响应倾向。 Result: 发现开源模型(如Mistral-7B)在历史修正主义、宣传和政治操纵等领域攻击成功率高达97%-98%;模型在21世纪或前20世纪语境下更脆弱,对拉丁美洲、美国和英国相关提示更易产生有害回应。 Conclusion: 当前的大语言模型安全防护机制在社会政治高风险场景中泛化能力不足,暴露出系统性偏见,可能威胁人权与民主价值,亟需针对性改进。 Abstract: Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at https://huggingface.co/datasets/psyonp/SocialHarmBench.

[96] Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment

Davood Rafiei,Morgan Lindsay Heisler,Weiwei Zhang,Mohammadreza Pourreza,Yong Zhang

Main category: cs.CL

TL;DR: 本文研究了自然语言到SQL任务中监督微调(SFT)的数据集结构对齐问题,提出通过比较训练集、目标数据和模型预测中的SQL结构特征分布来评估对齐程度,并发现结构对齐度是微调效果的强预测指标。

Details Motivation: 由于训练数据的差异性可能影响大模型在不同领域上的泛化能力,因此需要研究NL2SQL任务中训练数据与目标查询之间的结构对齐问题。 Method: 通过分析训练集、目标数据和模型预测前的SQL结构特征分布,量化数据集间的结构对齐程度,并在多个大规模跨域NL2SQL基准和模型族上进行实验验证。 Result: 结构对齐度越高,SFT带来的准确率和SQL生成质量提升越显著;对齐度低时,改进甚微或没有提升。 Conclusion: 结构对齐是影响SFT效果的关键因素,应重视对齐感知的数据选择以提升NL2SQL任务的微调效果和泛化能力。 Abstract: Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model's ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model's predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.

[97] The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models

Amir Hameed Mir

Main category: cs.CL

TL;DR: 提出了一种名为Layer-wise Semantic Dynamics (LSD) 的几何框架,用于检测大语言模型中的幻觉现象,通过分析Transformer层间隐藏状态的语义演化,在无需多次采样或外部验证的情况下实现高效、准确的幻觉检测。

Details Motivation: 大语言模型常生成流畅但事实错误的内容(即幻觉),在高风险领域带来严重隐患,现有方法依赖多次采样或外部验证,效率低且不具可扩展性。 Method: 利用基于边距的对比学习,将模型各层的隐藏激活与来自事实编码器的真值嵌入对齐,通过观察语义轨迹的稳定性来区分真实回答与幻觉:真实回答保持稳定的语义对齐,而幻觉则表现出明显的语义漂移。 Result: 在TruthfulQA和合成数据集上,LSD达到0.92的F1分数、0.96的AUROC和0.89的聚类准确率,优于SelfCheckGPT和Semantic Entropy等基线方法,并实现5-20倍的速度提升。 Conclusion: LSD提供了一种可扩展、模型无关的实时幻觉监测机制,揭示了大语言模型中事实一致性的几何特性,为内在幻觉检测提供了新思路。 Abstract: Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model's representational space. Using margin-based contrastive learning, LSD aligns hidden activations with ground-truth embeddings derived from a factual encoder, revealing a distinct separation in semantic trajectories: factual responses preserve stable alignment, while hallucinations exhibit pronounced semantic drift across depth. Evaluated on the TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming SelfCheckGPT and Semantic Entropy baselines while requiring only a single forward pass. This efficiency yields a 5-20x speedup over sampling-based methods without sacrificing precision or interpretability. LSD offers a scalable, model-agnostic mechanism for real-time hallucination monitoring and provides new insights into the geometry of factual consistency within large language models.

[98] A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

Juan-José Guzmán-Landa,Juan-Manuel Torres-Moreno,Miguel Figueroa-Saavedra,Ligia Quintana-Torres,Martha-Lorena Avendaño-Garrido,Graham Ranger

Main category: cs.CL

TL;DR: 本文提出了一种用于纳瓦特尔语的上下文无关文法(CFG),旨在通过生成大量语法正确的虚拟句子来扩充稀缺语料库,进而提升语言模型训练效果。

Details Motivation: 纳瓦特尔语是一种数字资源匮乏的美洲原住民语言,缺乏足够的语料支持机器学习研究,因此需要通过文法手段人工扩充语料库。 Method: 设计并实现一种上下文无关文法(CFG)来生成语法正确的纳瓦特尔语句子,并利用该文法扩展名为π-Yalli的语料库,随后使用FastText等算法进行训练和评估。 Result: 初步结果显示,使用该文法扩充语料库后,在句子级语义任务上相比某些大语言模型取得了相对改进,但仍有提升空间。 Conclusion: 上下文无关文法能有效扩展低资源语言的语料库并提升模型性能,但需要更精确的语言文法以实现更显著的改进。 Abstract: In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $\pi$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $\pi$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

[99] Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)

Om Dobariya,Akhil Kumar

Main category: cs.CL

TL;DR: 该研究探讨了提示语的礼貌程度对大语言模型(如ChatGPT 4o)在多项选择题上表现的影响,发现不礼貌的提示语反而比礼貌的提示语准确率更高。

Details Motivation: 探索自然语言提示中礼貌和语气对大语言模型性能的影响,尤其是在先前研究较少涉及的语用层面。 Method: 构建包含50个基础问题(涵盖数学、科学和历史)的数据集,每个问题生成五种语气变体(非常礼貌、礼貌、中性、粗鲁、非常粗鲁),共250个提示;使用ChatGPT 4o生成回答,并通过配对样本t检验分析结果。 Result: 不礼貌的提示语表现优于礼貌的提示语,准确率从非常礼貌的80.8%上升到非常粗鲁的84.8%,差异具有统计显著性。 Conclusion: 提示语的语气显著影响模型输出,较新的大语言模型可能对不礼貌语气反应更积极,这挑战了以往认知,并强调需进一步研究提示工程中的社会性和语用因素。 Abstract: The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human-AI interaction.

[100] AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives

Khalid Mehtab Khan,Anagha Kulkarni

Main category: cs.CL

TL;DR: 本文提出了一种名为AWARE的框架,通过增强模型对领域、上下文和类别重叠的感知能力,提升在学生反思文本中识别文化资本主题的准确性。

Details Motivation: 由于学生反思中的文化资本主题常以叙事形式隐含表达,标准NLP模型因缺乏领域和上下文意识而难以准确识别。 Method: AWARE框架包含三个核心组件:领域感知(调整模型词汇以适应学生语言风格)、上下文感知(生成考虑全文语境的句子嵌入)和类别重叠感知(采用多标签策略识别共现主题)。 Result: AWARE在Macro-F1上比强基线模型提升了2.1个百分点,并在所有主题上均表现出显著改进。 Conclusion: 通过显式增强模型对输入特性的感知,AWARE为依赖叙事语境的文本分类任务提供了一个鲁棒且可推广的方法。 Abstract: Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model's awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model's vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.

[101] Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning

Imran Mansha

Main category: cs.CL

TL;DR: 提出一种资源高效的微调方法,使用LoRA和QLoRA技术对LLaMA-3.2-3B进行医学领域链式思维推理优化,在低资源环境下显著降低内存消耗并保持良好的推理能力。

Details Motivation: 大型语言模型在医学推理任务中表现出色,但全量微调需要大量计算资源,限制了其在低资源环境下的应用。因此,需要一种更高效的微调方法。 Method: 采用参数高效微调技术(如LoRA和QLoRA),在公开的医学推理数据集上对LLaMA-3.2-3B模型进行适应性训练,以提升其在医学领域的链式思维推理能力。 Result: 该方法在减少最多60%内存使用的同时,提升了模型的推理连贯性和事实准确性,并在医学问答任务中保持了较强的推理能力。 Conclusion: 轻量级微调技术能够在资源受限条件下有效提升医学大模型的推理性能,为低资源环境下的医学AI系统部署提供了可行方案。 Abstract: Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated remarkable reasoning abilities but require significant computational resources for fine-tuning. This paper presents a resource-efficient fine-tuning approach for LLaMA-3.2-3B to enhance medical chain-of-thought reasoning while operating under constrained GPU and memory settings. Using parameter-efficient tuning techniques such as LoRA and QLoRA, we adapt the base model on publicly available medical reasoning datasets. The model achieves improved reasoning coherence and factual accuracy while reducing memory usage by up to 60% compared to standard full fine-tuning. Experimental evaluation demonstrates that lightweight adaptations can retain strong reasoning capability in medical question-answering tasks. This work highlights practical strategies for deploying LLMs in low-resource research environments and provides insights into balancing efficiency and domain specialization for medical AI systems.

[102] Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao,Yiming Li,Chao Du,Xin Wang,Xingjun Ma,Shu-Tao Xia,Tianyu Pang

Main category: cs.CL

TL;DR: 本文提出了一种利用Unicode中的变体选择符实现的不可见越狱攻击方法,通过在恶意问题后添加不可见字符来秘密改变其分词结果,从而绕过对齐的语言模型的安全限制。

Details Motivation: 现有的文本模态越狱攻击通常需要可见的修改,这容易被察觉。本文旨在探索一种完全不可见的攻击方式,以更隐蔽地突破大语言模型的安全防护机制。 Method: 引入一类名为变体选择符的Unicode字符作为不可见后缀,并提出链式搜索(chain-of-search)生成策略来寻找有效的对抗性后缀,从而触发有害响应。 Result: 实验表明,该方法在四个对齐的大语言模型上实现了高攻击成功率,并能推广到提示注入攻击,且不会在输入文本中产生任何可见修改。 Conclusion: 通过利用变体选择符的不可见特性,可以有效实施无需视觉改动的越狱攻击,揭示了当前语言模型在处理Unicode字符时的安全漏洞。 Abstract: Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

[103] A Set of Quebec-French Corpus of Regional Expressions and Terms

David Beauchemin,Yan Tremblay,Mohamed Amine Youssef,Richard Khoury

Main category: cs.CL

TL;DR: 本文提出将习语理解与方言理解结合,利用地区性习语作为衡量法语魁北克方言理解能力的基准,并构建了两个新数据集QFrCoRE和QFrCoRT,实验表明这些基准能有效评估大语言模型在特定方言上的表现。

Details Motivation: 为了更准确地评估语言模型对方言的理解能力,作者认为应结合具有地域特色的习语进行测试,从而填补现有方言理解评估方法的不足。 Method: 构建了两个针对法语魁北克方言的习语数据集:QFrCoRE(包含4,633个习语短语实例)和QFrCoRT(包含171个地区性习语词实例),并基于这些数据集对94个大语言模型进行了评测。 Result: 实验证明,所提出的区域习语基准能够有效衡量模型在特定方言上的掌握程度,且该构建方法可推广至其他方言。 Conclusion: 区域习语是评估方言理解能力的有效工具,所构建的数据集为未来方言感知的语言模型研究提供了可靠基准。 Abstract: The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 94 LLM demonstrate that our regional idiom benchmarks are a reliable tool for measuring a model's proficiency in a specific dialect.

[104] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan,Asaf Yehudai,Roi pony,Eyal Shnarch,Ariel Gera

Main category: cs.CL

TL;DR: 本文提出了一种名为Guided Query Refinement (GQR)的新型测试时优化方法,通过利用辅助检索器的得分来精化主检索器的查询嵌入,在视觉文档检索中实现了性能与效率的帕累托前沿突破。

Details Motivation: 现有的多模态检索模型在扩展表示规模时面临部署和可扩展性问题,且纯视觉中心方法受限于现代视觉-语言模型的模态差距。作者希望探索轻量级文本检索器是否能增强更强的视觉中心模型,并改进现有粗粒度融合方法对模型内部丰富交互利用不足的问题。 Method: 提出Guided Query Refinement (GQR),一种在测试时优化的方法,通过一个互补的轻量级密集文本检索器提供的分数来优化主视觉中心检索器的查询嵌入,从而实现更高效的多模态检索。 Result: 在多个视觉文档检索基准上的实验表明,GQR使视觉中心模型能够匹敌具有更大表示规模的模型的性能,同时速度快达14倍,内存消耗减少54倍。 Conclusion: GQR有效提升了多模态检索中性能与效率的平衡,推动了帕累托前沿,为实际应用中的可扩展性和部署提供了可行解决方案。 Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

[105] COLE: a Comprehensive Benchmark for French Language Understanding Evaluation

David Beauchemin,Yan Tremblay,Mohamed Amine Youssef,Richard Khoury

Main category: cs.CL

TL;DR: 本文介绍了COLE,一个包含23个多样化任务的法语自然语言理解(NLU)新基准,涵盖了情感分析、释义检测、语法判断和推理等能力,并特别关注与法语相关的语言现象。作者对94个大语言模型进行了评估,揭示了闭源与开源模型之间的显著性能差距,并指出了当前模型在零样本抽取式问答、细粒度词义消歧和地区语言变体理解方面的挑战。COLE将作为公共资源发布,以推动法语建模的进一步发展。

Details Motivation: 为了更全面地评估法语自然语言理解(NLU)能力,现有基准不足以覆盖法语特有的语言现象和多样化的NLU任务,因此需要一个新的综合性基准来衡量和推动法语语言模型的发展。 Method: 提出COLE基准,包含23个多样化的NLU任务,覆盖广泛的语言能力,特别关注法语特有的语言现象。对94个大语言模型进行系统性评估,分析其在不同任务上的表现,特别是闭源与开源模型之间的差异。 Result: 评估结果显示闭源模型整体优于开源模型,但在零样本抽取式问答、细粒度词义消歧和地区语言变体理解等任务上所有模型均表现不佳,显示出这些是当前法语NLU的关键挑战。 Conclusion: COLE为法语NLU提供了一个全面且具有挑战性的评估基准,揭示了当前大语言模型的优势与不足,其公开发布有助于推动法语语言模型的进一步研究与发展。 Abstract: To address the need for a more comprehensive evaluation of French Natural Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23 diverse task covering a broad range of NLU capabilities, including sentiment analysis, paraphrase detection, grammatical judgment, and reasoning, with a particular focus on linguistic phenomena relevant to the French language. We benchmark 94 large language models (LLM), providing an extensive analysis of the current state of French NLU. Our results highlight a significant performance gap between closed- and open-weights models and identify key challenging frontiers for current LLMs, such as zero-shot extractive question-answering (QA), fine-grained word sense disambiguation, and understanding of regional language variations. We release COLE as a public resource to foster further progress in French language modelling.

[106] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Dachuan Shi,Abedelkadir Asi,Keying Li,Xiangchi Yuan,Leyan Pan,Wenke Lee,Wen Xiao

Main category: cs.CL

TL;DR: 提出了一种无需训练的推理框架SwiReasoning,通过在显式和隐式推理之间动态切换,并基于块级置信度控制思考过程,有效提升了大模型在数学和STEM任务中的准确性和令牌效率。

Details Motivation: 现有大语言模型在离散的显式推理(如思维链)之外展现出潜在空间中的连续推理能力,但纯潜在推理会因维持多条隐含路径而扩散概率质量、引入噪声、阻碍收敛,且存在无意义的过度思考问题,影响准确性和效率。 Method: 提出SwiReasoning框架:1)基于下一词预测分布的熵趋势估计块级置信度,动态在显式与潜在推理间切换,平衡探索与利用;2)限制最大思考块切换次数,防止过度思考。该方法无需训练。 Result: 在多个数学与STEM基准上,SwiReasoning在不同家族和规模的推理模型上平均提升准确率1.5%-2.8%;在受限预算下,令牌效率提升56%-79%,且预算越紧增益越大。 Conclusion: SwiReasoning通过动态切换推理模式和控制思考深度,在无需训练的前提下有效缓解了潜在推理中的收敛困难和过思考问题,显著提升了大语言模型在复杂推理任务中的准确性与推理效率。 Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.

[107] Slm-mux: Orchestrating small language models for reasoning

Chenyu Wang,Zishen Wan,Hao Kang,Emma Chen,Zhiqiang Xie,Tushar Krishna,Vijay Janapa Reddi,Yilun Du

Main category: cs.CL

TL;DR: 提出了一种三阶段方法SLM-MUX,用于有效协同多个小语言模型(SLMs),通过模型选择搜索和测试时扩展策略,在MATH、GPQA和GSM8K等任务上显著优于现有方法,并在某些情况下超越大型模型Qwen 2.5 72B的性能。

Details Motivation: 尽管小语言模型(SLMs)效率高且擅长特定任务,但现有协同方法主要针对大模型,对SLMs效果不佳,因此需要专门针对SLMs设计更有效的协同机制。 Method: 提出SLM-MUX多模型架构,结合两阶段优化:1)从候选池中搜索最互补的SLMs;2)为SLM-MUX定制测试时扩展策略。 Result: 相比现有方法,在MATH上提升13.4%,GPQA上提升8.8%,GSM8K上提升7.0%;仅用两个SLMs即超越Qwen 2.5 72B在GPQA和GSM8K上的表现,在MATH上与其持平。 Conclusion: SLMs可以通过所提出的SLM-MUX方法被高效协同,构建出更准确且高效的系统,验证了多SLM协同的潜力。 Abstract: With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.

[108] TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Janos Perczel,Jin Chow,Dorottya Demszky

Main category: cs.CL

TL;DR: TeachLM是一种通过高效参数微调优化教学的大型语言模型,利用真实学生-导师互动数据提升对话式教学能力。

Details Motivation: 解决当前大语言模型在教育应用中因缺乏高质量、真实学生学习数据而导致的教学能力不足问题。 Method: 基于Polygence提供的10万小时一对一、纵向学生-导师互动数据,采用参数高效微调方法训练TeachLM,并生成高保真合成对话用于评估。 Result: TeachLM显著提升了对话教学表现:学生发言时间翻倍,提问风格改善,对话轮次增加50%,教学个性化程度更高。同时提出了一种基于合成对话的多轮评估协议,实现快速、可扩展、可复现的LLM教学能力评估。 Conclusion: 使用真实教学数据进行微调能有效提升LLM的 pedagogical 和对话能力,TeachLM为AI教育应用提供了更可靠的技术路径和评估方法。 Abstract: The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.

[109] Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Runchu Tian,Junxia Cui,Xueqiang Xu,Feng Yao,Jingbo Shang

Main category: cs.CL

TL;DR: 提出了一种名为Tolerator的无需训练的解码策略,通过两阶段过程(序列填充和迭代优化)实现对已接受token的重新修正,显著提升扩散大语言模型的输出质量。

Details Motivation: 离散扩散大语言模型中,一旦token被接受便无法在后续步骤中修改,导致早期错误持续影响最终结果。因此需要一种能动态修正错误的解码策略。 Method: 提出Tolerator,采用两阶段解码:首先进行序列填充,然后通过重新掩码并解码部分token进行迭代优化,利用其余token作为上下文进行交叉验证式修正。 Result: 在五个涵盖语言理解、代码生成和数学任务的标准基准上实验表明,Tolerator在相同计算预算下 consistently 优于基线方法。 Conclusion: 解码算法对充分发挥扩散大语言模型潜力至关重要,Tolerator通过允许已接受token被重新评估和修正,有效提升了生成质量和鲁棒性。 Abstract: Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.

cs.CV [Back]

[110] SoC-DT: Standard-of-Care Aligned Digital Twins for Patient-Specific Tumor Dynamics

Moinak Bhattacharya,Gagandeep Singh,Prateek Prasanna

Main category: cs.CV

TL;DR: 提出了一种名为SoC-DT的可微分框架,用于在标准治疗下个性化预测肿瘤动态,结合反应-扩散模型与真实治疗干预,在合成和真实胶质瘤数据上优于传统方法。

Details Motivation: 现有反应-扩散模型无法准确模拟标准治疗下的肿瘤动态,且难以整合患者间的基因组、人口统计学和治疗差异,亟需更真实的计算框架。 Method: 提出SoC-DT框架,融合反应-扩散肿瘤生长模型与离散治疗干预(手术、化疗、放疗),并引入IMEX-SoC求解器以保证数值稳定性;支持基因组和人口统计学个性化。 Result: 在合成数据和真实胶质瘤数据上,SoC-DT在预测治疗后肿瘤结构方面优于经典PDE模型和纯数据驱动神经网络模型。 Conclusion: SoC-DT通过结合机制可解释性与现代可微分求解技术,为肿瘤学中的患者特异性数字孪生提供了可靠基础,能够实现生物学一致的肿瘤动态预测。 Abstract: Accurate prediction of tumor trajectories under standard-of-care (SoC) therapies remains a major unmet need in oncology. This capability is essential for optimizing treatment planning and anticipating disease progression. Conventional reaction-diffusion models are limited in scope, as they fail to capture tumor dynamics under heterogeneous therapeutic paradigms. There is hence a critical need for computational frameworks that can realistically simulate SoC interventions while accounting for inter-patient variability in genomics, demographics, and treatment regimens. We introduce Standard-of-Care Digital Twin (SoC-DT), a differentiable framework that unifies reaction-diffusion tumor growth models, discrete SoC interventions (surgery, chemotherapy, radiotherapy) along with genomic and demographic personalization to predict post-treatment tumor structure on imaging. An implicit-explicit exponential time-differencing solver, IMEX-SoC, is also proposed, which ensures stability, positivity, and scalability in SoC treatment situations. Evaluated on both synthetic data and real world glioma data, SoC-DT consistently outperforms classical PDE baselines and purely data-driven neural models in predicting tumor dynamics. By bridging mechanistic interpretability with modern differentiable solvers, SoC-DT establishes a principled foundation for patient-specific digital twins in oncology, enabling biologically consistent tumor dynamics estimation. Code will be made available upon acceptance.

[111] Visualizing Celebrity Dynamics in Video Content: A Proposed Approach Using Face Recognition Timestamp Data

Doğanay Demir,İlknur Durgar Elkahlout

Main category: cs.CV

TL;DR: 提出了一种结合分布式多GPU推理与交互式可视化平台的混合框架,用于分析视频中名人动态。

Details Motivation: 在视频内容主导的时代,理解其结构和动态变得愈发重要,需要高效分析名人出现模式和互动关系。 Method: 采用基于ONNX模型的分布式多GPU推理系统进行高通量批处理,生成带时间戳的出现记录,并通过多种可视化手段(如网络图、热力图、共现矩阵等)进行交互式分析。 Result: 实现了对大规模视频数据的高效处理,生成了多维度的可视化结果,揭示了名人在不同剧集和季节中的 prominence、屏幕时间分布、共现关系和动态变化。 Conclusion: 该框架通过融合分布式识别与可视化分析,为娱乐分析、内容创作策略和观众参与研究提供了新工具。 Abstract: In an era dominated by video content, understanding its structure and dynamics has become increasingly important. This paper presents a hybrid framework that combines a distributed multi-GPU inference system with an interactive visualization platform for analyzing celebrity dynamics in video episodes. The inference framework efficiently processes large volumes of video data by leveraging optimized ONNX models, heterogeneous batch inference, and high-throughput parallelism, ensuring scalable generation of timestamped appearance records. These records are then transformed into a comprehensive suite of visualizations, including appearance frequency charts, duration analyses, pie charts, co-appearance matrices, network graphs, stacked area charts, seasonal comparisons, and heatmaps. Together, these visualizations provide multi-dimensional insights into video content, revealing patterns in celebrity prominence, screen-time distribution, temporal dynamics, co-appearance relationships, and intensity across episodes and seasons. The interactive nature of the system allows users to dynamically explore data, identify key moments, and uncover evolving relationships between individuals. By bridging distributed recognition with structured, visually-driven analytics, this work enables new possibilities for entertainment analytics, content creation strategies, and audience engagement studies.

[112] Domain-Robust Marine Plastic Detection Using Vision Models

Saanvi Kataria

Main category: cs.CV

TL;DR: 本研究评估了多种深度学习模型在跨域水下塑料垃圾检测中的鲁棒性,发现轻量级MobileNetV2表现最佳(F1达0.97),而零样本模型CLIP和Gemini各有优劣。

Details Motivation: 由于数据域差异,基于单一数据集训练的视觉系统在新图像上性能下降,亟需提升模型在不同环境下的泛化能力。 Method: 采用CNN(如MobileNetV2、ResNet-18)和视觉Transformer(如DeiT-Tiny、ViT-B16)在标注水下数据集上训练,并在来自不同源的跨域测试集上评估;同时评估了无需微调的零样本模型CLIP ViT-L14和Gemini 2.0 Flash。 Result: MobileNetV2在跨域检测中表现最优(F1=0.97),所有微调模型精度均接近99%,但召回率存在差异;CLIP召回率约80%但精度仅56%,Gemini精度约99%但召回率约81%。错误分析显示模型易将珊瑚纹理、悬浮颗粒和镜面反光误判为塑料。 Conclusion: 轻量级CNN在监督训练下可有效实现跨域水下垃圾检测,而大规模预训练视觉语言模型虽未微调也展现出互补优势,适用于特定场景。 Abstract: Marine plastic pollution is a pressing environmental threat, making reliable automation for underwater debris detection essential. However, vision systems trained on one dataset often degrade on new imagery due to domain shift. This study benchmarks models for cross-domain robustness, training convolutional neural networks - CNNs (MobileNetV2, ResNet-18, EfficientNet-B0) and vision transformers (DeiT-Tiny, ViT-B16) on a labeled underwater dataset and then evaluates them on a balanced cross-domain test set built from plastic-positive images drawn from a different source and negatives from the training domain. Two zero-shot models were assessed, CLIP ViT-L14 and Google's Gemini 2.0 Flash, that leverage pretraining to classify images without fine-tuning. Results show the lightweight MobileNetV2 delivers the strongest cross-domain performance (F1 0.97), surpassing larger models. All fine-tuned models achieved high Precision (around 99%), but differ in Recall, indicating varying sensitivity to plastic instances. Zero-shot CLIP is comparatively sensitive (Recall around 80%) yet prone to false positives (Precision around 56%), whereas Gemini exhibits the inverse profile (Precision around 99%, Recall around 81%). Error analysis highlights recurring confusions with coral textures, suspended particulates, and specular glare. Overall, compact CNNs with supervised training can generalize effectively for cross-domain underwater detection, while large pretrained vision-language models provide complementary strengths.

[113] Multimodal Arabic Captioning with Interpretable Visual Concept Integration

Passant Elchafei,Amany Fashwan

Main category: cs.CV

TL;DR: VLCAP是一个结合CLIP-based视觉标签检索与多模态文本生成的阿拉伯语图像描述框架,通过提取可解释的阿拉伯语视觉概念,提升生成的文化一致性和上下文准确性。

Details Motivation: 传统端到端阿拉伯语图像描述模型缺乏可解释性且难以保证文化相关性,因此需要一种基于可视概念检索的新型框架来提高生成质量。 Method: VLCAP使用三种多语言编码器(mCLIP、AraCLIP、Jina V4)进行阿拉伯语视觉标签检索,构建包含约21K个来自Visual Genome数据集的通用领域标签的混合词汇表,并将检索到的标签转化为流畅的阿拉伯语提示,输入到Qwen-VL和Gemini Pro Vision等视觉-语言模型中生成描述。 Result: 实验显示,mCLIP + Gemini Pro Vision在BLEU-1(5.34%)和余弦相似度(60.01%)上表现最佳,而AraCLIP + Qwen-VL获得最高的LLM-judge评分(36.33%)。 Conclusion: VLCAP通过引入可解释的视觉概念检索机制,有效提升了阿拉伯语图像描述的文化适配性和语义准确性,是一种有前景的多阶段生成方法。 Abstract: We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.

[114] Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes

Akshar Gothi

Main category: cs.CV

TL;DR: 本研究对比了EfficientNet-B0和ViT-Base在SpaceNet数据集上的性能,发现在类别不平衡和平衡条件下两者准确率相近,但EfficientNet-B0具有更低延迟和更高效率,尤其在平衡数据上达到99%准确率。

Details Motivation: 比较卷积神经网络与视觉Transformer在遥感图像分类任务中的性能差异,特别是在不同标签分布下的表现。 Method: 在相同预处理、轻量增强和训练预算下,评估EfficientNet-B0和ViT-Base在两类标签分布(原始不平衡和重采样平衡)下的分类性能与部署指标。 Result: 在不平衡数据上,两者测试准确率均为93%,EfficientNet-B0宏F1更高、延迟更低;在平衡数据上,EfficientNet-B0达99%准确率,ViT-Base紧随其后,模型差距缩小但CNN仍具效率优势。 Conclusion: 数据平衡可缩小CNN与ViT的性能差距,但EfficientNet-B0在精度和效率之间取得更好权衡,更适合资源受限的部署场景。 Abstract: We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.

[115] A Comprehensive Review on Artificial Intelligence Empowered Solutions for Enhancing Pedestrian and Cyclist Safety

Shucheng Zhang,Yan Shi,Bingzhang Wang,Yuang Zhang,Muhammad Monjurul Karim,Kehua Chen,Chenxi Liu,Mehrdad Nasri,Yinhai Wang

Main category: cs.CV

TL;DR: 本文综述了近五年基于摄像头的AI感知系统在保护弱势道路使用者(VRUs)方面的最新进展,重点关注检测、跟踪、轨迹预测和意图识别四项核心任务,并提出了数据、模型和部署方面的四大开放挑战。

Details Motivation: 传统基础设施在动态城市环境中对VRU的保护不足,现有研究多集中于检测任务,缺乏对其他视觉任务的全面覆盖,因此需要系统性综述以推动全面的VRU安全解决方案。 Method: 对过去五年的文献进行系统性回顾,聚焦于视觉AI在VRU安全中的四个核心任务:检测与分类、跟踪与再识别、轨迹预测、以及意图识别与预测,并从数据、模型和部署角度分析当前挑战。 Result: 梳理了基于视觉的AI系统在VRU安全领域的最新进展,明确了四个关键任务的技术发展脉络,总结了当前研究的主要成果与局限。 Conclusion: 该综述为开发下一代智能交通系统中的VRU安全感知技术提供了基础参考,强调需结合视觉AI进步与实际部署需求,推动更智能、主动的安全解决方案。 Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, remains a critical global challenge, as conventional infrastructure-based measures often prove inadequate in dynamic urban environments. Recent advances in artificial intelligence (AI), particularly in visual perception and reasoning, open new opportunities for proactive and context-aware VRU protection. However, existing surveys on AI applications for VRUs predominantly focus on detection, offering limited coverage of other vision-based tasks that are essential for comprehensive VRU understanding and protection. This paper presents a state-of-the-art review of recent progress in camera-based AI sensing systems for VRU safety, with an emphasis on developments from the past five years and emerging research trends. We systematically examine four core tasks, namely detection and classification, tracking and reidentification, trajectory prediction, and intent recognition and prediction, which together form the backbone of AI-empowered proactive solutions for VRU protection in intelligent transportation systems. To guide future research, we highlight four major open challenges from the perspectives of data, model, and deployment. By linking advances in visual AI with practical considerations for real-world implementation, this survey aims to provide a foundational reference for the development of next-generation sensing systems to enhance VRU safety.

[116] The View From Space: Navigating Instrumentation Differences with EOFMs

Ryan P. Demilt,Nicholas LaHaye,Karis Tenneson

Main category: cs.CV

TL;DR: 本文研究了地球观测基础模型(EOFMs)在不同传感器架构下的表示空间敏感性,指出当前多模态应用中忽略传感器差异的问题,并呼吁基于遥感科学的更稳健模型设计。

Details Motivation: 现有EOFM大多仅在单一模态数据上训练,并跨模态匹配波段进行应用或基准测试,但传感器架构差异对模型内部表示的影响尚不明确。 Method: 分析预训练EOFM在不同传感器架构下的嵌入表示,评估其对模型表征空间的影响。 Result: 发现EOFM的表示空间对传感器架构高度敏感,不同传感器导致显著的表示差异。 Conclusion: 传感器架构差异严重影响EOFM的表示一致性,需在模型设计、应用和评估中充分考虑这一因素,以推动更可靠的多模态遥感建模。 Abstract: Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

[117] Photorealistic Inpainting for Perturbation-based Explanations in Ecological Monitoring

Günel Aghakishiyeva,Jiayi Zhou,Saagar Arya,James David Poling,Holly R. Houliston,Jamie N. Womble,David W. Johnston,Brinnae Bent

Main category: cs.CV

TL;DR: 提出一种基于修复引导的扰动解释方法,生成保持场景上下文的逼真编辑,用于揭示视觉模型在生态监测中的决策依据。

Details Motivation: 现有的视觉模型预测缺乏透明度,限制了其在生态监测中的可信度和实地应用。 Method: 采用基于修复的扰动方法,结合Segment-Anything Model优化掩码,实现对象移除/替换和背景替换,并通过重评分和专家评审评估解释效果。 Result: 该方法能有效定位影响预测的关键形态特征,避免传统扰动带来的失真,提升解释的生态合理性和可解释性。 Conclusion: 所提方法增强了AI模型在生态学中应用的透明度与可信度,支持专家验证并促进其实际部署。 Abstract: Ecological monitoring is increasingly automated by vision models, yet opaque predictions limit trust and field adoption. We present an inpainting-guided, perturbation-based explanation technique that produces photorealistic, mask-localized edits that preserve scene context. Unlike masking or blurring, these edits stay in-distribution and reveal which fine-grained morphological cues drive predictions in tasks such as species recognition and trait attribution. We demonstrate the approach on a YOLOv9 detector fine-tuned for harbor seal detection in Glacier Bay drone imagery, using Segment-Anything-Model-refined masks to support two interventions: (i) object removal/replacement (e.g., replacing seals with plausible ice/water or boats) and (ii) background replacement with original animals composited onto new scenes. Explanations are assessed by re-scoring perturbed images (flip rate, confidence drop) and by expert review for ecological plausibility and interpretability. The resulting explanations localize diagnostic structures, avoid deletion artifacts common to traditional perturbations, and yield domain-relevant insights that support expert validation and more trustworthy deployment of AI in ecology.

[118] Advances in Medical Image Segmentation: A Comprehensive Survey with a Focus on Lumbar Spine Applications

Ahmed Kabil,Ghada Khoriba,Mina Yousef,Essam A. Rashed

Main category: cs.CV

TL;DR: 本文综述了医学图像分割(MIS)的各类方法,涵盖传统图像处理与现代深度学习技术,探讨了当前趋势与挑战,并以腰椎分割为例进行了案例分析。

Details Motivation: 为了系统地总结医学图像分割的发展现状,弥合传统方法与深度学习之间的差距,并探讨该领域面临的挑战和未来方向。 Method: 对阈值分割、边缘检测、区域分割、聚类算法、模型驱动方法以及CNN、FCN、U-Net及其变体等深度学习架构进行了全面综述,同时分析了注意力机制、半监督学习、生成对抗网络和Transformer等技术的应用。 Result: 总结了混合架构、跨模态学习、联邦学习和主动学习等新兴趋势,揭示了在标注数据有限、计算复杂性和模型泛化方面的进展,并通过腰椎分割案例展示了特定解剖区域的挑战与进步。 Conclusion: 尽管医学图像分割取得了显著进展,但仍面临数据偏差、领域适应性、模型可解释性及临床实际集成等关键挑战。 Abstract: Medical Image Segmentation (MIS) stands as a cornerstone in medical image analysis, playing a pivotal role in precise diagnostics, treatment planning, and monitoring of various medical conditions. This paper presents a comprehensive and systematic survey of MIS methodologies, bridging the gap between traditional image processing techniques and modern deep learning approaches. The survey encompasses thresholding, edge detection, region-based segmentation, clustering algorithms, and model-based techniques while also delving into state-of-the-art deep learning architectures such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs), and the widely adopted U-Net and its variants. Moreover, integrating attention mechanisms, semi-supervised learning, generative adversarial networks (GANs), and Transformer-based models is thoroughly explored. In addition to covering established methods, this survey highlights emerging trends, including hybrid architectures, cross-modality learning, federated and distributed learning frameworks, and active learning strategies, which aim to address challenges such as limited labeled datasets, computational complexity, and model generalizability across diverse imaging modalities. Furthermore, a specialized case study on lumbar spine segmentation is presented, offering insights into the challenges and advancements in this relatively underexplored anatomical region. Despite significant progress in the field, critical challenges persist, including dataset bias, domain adaptation, interpretability of deep learning models, and integration into real-world clinical workflows.

[119] DECOR: Deep Embedding Clustering with Orientation Robustness

Fiona Victoria Stanley Jothiraj,Arunaggiri Pandian Karunanidhi,Seth A. Eichmeyer

Main category: cs.CV

TL;DR: 本文提出了一种名为DECOR的深度聚类框架,用于在半导体制造中对晶圆缺陷模式进行可靠聚类,尤其适用于复杂、无标签且不平衡的数据。

Details Motivation: 由于晶圆数据常具有复杂性、无标签、类别不平衡及单个晶圆多缺陷等问题,传统聚类方法难以稳定工作,因此需要一种能应对这些挑战的鲁棒方法。 Method: 提出DECOR框架,结合深度聚类与方向鲁棒性设计,通过考虑晶圆图的空间方向变化,确保旋转或对齐不同的相似缺陷被一致聚类。 Result: 在公开数据集MixedWM38上验证,DECOR无需手动调参即可有效发现缺陷簇,并优于现有聚类基线方法。 Conclusion: DECOR为自动视觉检测系统提供了一个可靠且可扩展的解决方案,能够在真实制造环境中高效识别和分类晶圆缺陷。 Abstract: In semiconductor manufacturing, early detection of wafer defects is critical for product yield optimization. However, raw wafer data from wafer quality tests are often complex, unlabeled, imbalanced and can contain multiple defects on a single wafer, making it crucial to design clustering methods that remain reliable under such imperfect data conditions. We introduce DECOR, a deep clustering with orientation robustness framework that groups complex defect patterns from wafer maps into consistent clusters. We evaluate our method on the open source MixedWM38 dataset, demonstrating its ability to discover clusters without manual tuning. DECOR explicitly accounts for orientation variations in wafer maps, ensuring that spatially similar defects are consistently clustered regardless of its rotation or alignment. Experiments indicate that our method outperforms existing clustering baseline methods, thus providing a reliable and scalable solution in automated visual inspection systems.

[120] Error correction in multiclass image classification of facial emotion on unbalanced samples

Andrey A. Lebedev,Victor B. Kazantsev,Sergey V. Stasenko

Main category: cs.CV

TL;DR: 本文提出了一种基于LSTM与注意力机制的神经网络模型,用于解决不平衡样本下多类别人脸图像的情感分类与误差校正问题,实验表明该方法在小样本类别上具有良好的纠错能力,适用于情感识别及反欺诈等稀有事件检测场景。

Details Motivation: 针对人脸表情分类中类别不平衡问题,尤其是某些情绪样本远少于其他情绪的情况,如何提升稀有类别的分类准确率并实现有效误差校正成为关键挑战。 Method: 采用带有注意力机制的LSTM神经网络模型,聚焦面部关键区域进行特征提取;通过在六类子集上训练模型,并对第七类(被排除类)进行误差校正实验,评估不同类别组合下的恢复效果。 Result: 模型能够在所有类别上实现一定程度的误差校正,部分类别恢复效果显著,且在测试集中观察到小样本类别的关键质量指标有所提升。 Conclusion: 所提方法在处理类别不平衡问题上具有潜力,尤其适用于需要稳定分类性能的现实应用场景,如面部表情分析和反欺诈系统中的稀有事件识别。 Abstract: This paper considers the problem of error correction in multi-class classification of face images on unbalanced samples. The study is based on the analysis of a data frame containing images labeled by seven different emotional states of people of different ages. Particular attention is paid to the problem of class imbalance, in which some emotions significantly prevail over others. To solve the classification problem, a neural network model based on LSTM with an attention mechanism focusing on key areas of the face that are informative for emotion recognition is used. As part of the experiments, the model is trained on all possible configurations of subsets of six classes with subsequent error correction for the seventh class, excluded at the training stage. The results show that correction is possible for all classes, although the degree of success varies: some classes are better restored, others are worse. In addition, on the test sample, when correcting some classes, an increase in key quality metrics for small classes was recorded, which indicates the promise of the proposed approach in solving applied problems related to the search for rare events, for example, in anti-fraud systems. Thus, the proposed method can be effectively applied in facial expression analysis systems and in tasks requiring stable classification under skewed class distribution.

[121] OpusAnimation: Code-Based Dynamic Chart Generation

Bozheng Li,Miao Yang,Zhenhan Chen,Jiawang Cao,Mushui Liu,Yi Lu,Yongliang Wu,Bin Zhang,Yangguang Ji,Licheng Tang,Jay Wu,Wenbo Zhu

Main category: cs.CV

TL;DR: 本文提出了DCG-Bench,首个评估多模态大模型在动态图表生成任务中表现的基准,并构建了高质量数据集DCG-8K。通过两阶段训练方法和联合代码-视觉奖励机制,训练出高效的Qwen2.5-VL-DCG-3B模型,在动态图表生成上优于现有开源模型,性能媲美闭源大模型。

Details Motivation: 现有的多模态大语言模型在静态图表生成方面已有进展,但在动态图表生成与理解方面仍缺乏探索。为填补这一研究空白,需要建立专门的基准和数据集来系统评估模型能力。 Method: 提出DCG-Bench,包含简单文本到图表、详细文本到图表和视频到图表三类任务;构建DCG-8K数据集,包含指令-代码-视频三元组及问答对;采用两阶段训练策略,结合联合代码-视觉奖励机制进行组相对策略优化。 Result: 实验表明现有MLLM在视觉到图表任务中存在不足;所提Qwen2.5-VL-DCG-3B模型在三个任务上平均提升8.31%,且以仅3B参数达到与闭源模型相当的性能。 Conclusion: 该研究验证了所提出训练策略的有效性,推动了动态图表生成领域的发展,为未来MLLM在动态可视化任务中的研究提供了重要基础。 Abstract: Dynamic Chart Generation (DCG) involves producing code-rendered animated visualizations as charts. While recent advances in multi-modal large language models (MLLMs) have significantly improved their capability on static chart generation and comprehension, MLLMs' potential for handling dynamic chart generation and understanding remains underexplored. To bridge this research gap, we introduce DCG-Bench (Dynamic Chart Generation Benchmark), the first benchmark evaluating MLLM's capability on dynamic chart generation tasks from three dimensions: Simple Text-to-Chart, Detailed Text-to-Chart, and Video-to-Chart tasks. We construct DCG-8K, a high-quality DCG dataset with annotations covering instruction-code-video triplets and QA pairs for both code and video evaluation. Based on DCG-8K, we explored a two-stage training recipe, proposing Joint-Code-Visual Reward for group relative policy optimization to construct expert MLLM Qwen2.5-VL-DCG-3B for the DCG task. Our benchmarking result reveals shortcomings of existing MLLMs in the visual-to-chart task, and our model beats the best open-sourced MLLM with an average 8.31% performance gain across three tasks, and shows on par performance against proprietary models with only 3B parameters, proving the effectiveness of our training recipe. Our code and dataset will be publicly available.

[122] Visual Odometry with Transformers

Vlardimir Yugay,Duy-Kien Nguyen,Theo Gevers,Cees G. M. Snoek,Martin R. Oswald

Main category: cs.CV

TL;DR: 本文提出了一种端到端的单目视觉里程计方法VoT,基于Transformer架构,通过时空注意力建模直接预测相机运动,无需依赖特征匹配、光束法平差或稠密三维重建等手工模块。

Details Motivation: 现有单目视觉里程计方法依赖复杂的模块组合与参数调优,在未见过的真实场景中表现不佳;同时,现有的大规模3D模型难以处理长序列视频和提供精确的逐帧位姿估计。因此,需要一种更简洁、通用且高效的端到端解决方案。 Method: 提出VoT(Visual odometry Transformer),利用预训练编码器提取单目图像序列特征,并通过时空注意力机制建模帧间全局关系,直接以相机位姿为监督信号预测相机运动,不依赖稠密几何重建。框架具有模块化设计,可灵活集成不同预训练主干网络。 Result: 实验表明,VoT在更大数据集上具有良好扩展性,得益于更强的预训练主干;在多种相机运动和标定条件下均表现出良好的泛化能力,性能优于传统方法,且运行速度提升三倍以上。 Conclusion: VoT实现了无需手工组件的端到端单目视觉里程计,兼具高精度、强泛化性和高效率,为视觉里程计提供了一种新的简洁而强大的范式。 Abstract: Modern monocular visual odometry methods typically combine pre-trained deep learning components with optimization modules, resulting in complex pipelines that rely heavily on camera calibration and hyperparameter tuning, and often struggle in unseen real-world scenarios. Recent large-scale 3D models trained on massive amounts of multi-modal data have partially alleviated these challenges, providing generalizable dense reconstruction and camera pose estimation. Still, they remain limited in handling long videos and providing accurate per-frame estimates, which are required for visual odometry. In this work, we demonstrate that monocular visual odometry can be addressed effectively in an end-to-end manner, thereby eliminating the need for handcrafted components such as bundle adjustment, feature matching, camera calibration, or dense 3D reconstruction. We introduce VoT, short for Visual odometry Transformer, which processes sequences of monocular frames by extracting features and modeling global relationships through temporal and spatial attention. Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision. The framework is modular and flexible, allowing seamless integration of various pre-trained encoders as feature extractors. Experimental results demonstrate that VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster. The code will be released.

[123] Inference-Time Search using Side Information for Diffusion-based Image Reconstruction

Mahdi Farahbakhsh,Vishnu Teja Kunde,Dileep Kalathil,Krishna Narayanan,Jean-Francois Chamberland

Main category: cs.CV

TL;DR: 提出一种新的推理时搜索算法,利用侧信息引导扩散模型的采样过程,提升图像重建质量。

Details Motivation: 现有扩散模型在解决逆问题时通常忽略可能显著改善重建质量的侧信息,尤其在严重病态的情况下。 Method: 设计了一种平衡探索与利用的推理时搜索算法,利用侧信息引导扩散模型的采样过程,避免基于梯度引导带来的奖励欺骗伪影。 Result: 在多种逆问题(如补全、超分辨率和各类去模糊)上实验表明,该方法在定性和定量指标上均优于现有方法,并优于基于奖励梯度的引导算法。 Conclusion: 所提方法可无缝集成到现有的扩散模型重建流程中,有效提升重建精度与可靠性。 Abstract: Diffusion models have emerged as powerful priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel inference-time search algorithm that guides the sampling process using the side information in a manner that balances exploration and exploitation. This enables more accurate and reliable reconstructions, providing an alternative to the gradient-based guidance that is prone to reward-hacking artifacts. Our approach can be seamlessly integrated into a wide range of existing diffusion-based image reconstruction pipelines. Through extensive experiments on a number of inverse problems, such as box inpainting, super-resolution, and various deblurring tasks including motion, Gaussian, nonlinear, and blind deblurring, we show that our approach consistently improves the qualitative and quantitative performance of diffusion-based image reconstruction algorithms. We also show the superior performance of our approach with respect to other baselines, including reward gradient-based guidance algorithms. The code is available at \href{https://github.com/mhdfb/sideinfo-search-reconstruction}{this repository}.

[124] Sonar Image Datasets: A Comprehensive Survey of Resources, Challenges, and Applications

Larissa S. Gomes,Gustavo P. Almeida,Bryan U. Moreira,Marco Quiroz,Breno Xavier,Lucas Soares,Stephanie L. Brião,Felipe G. Oliveira,Paulo L. J. Drews-Jr

Main category: cs.CV

TL;DR: 本文综述了当前声呐图像数据集的研究现状,系统整理了公开可用的多类型声呐数据集,分析了其在分类、检测、分割和三维重建等应用中的使用情况,并指出现有数据集的不足与研究空白,为水下声学数据分析领域提供了清晰的发展路线图。

Details Motivation: 由于公开、标注良好的声呐图像数据集稀缺,制约了机器学习模型在水下探测等领域的应用,因此需要对现有数据集进行全面梳理并识别研究缺口。 Method: 本文调研了多种声呐模态(如侧扫声呐、前视声呐、合成孔径声呐等)的公开数据集,分析其在不同应用场景下的使用情况,并通过主表格和时间线进行系统归纳与对比。 Result: 构建了一个包含声呐数据集特征、规模和标注信息的综合主表和时间线,明确了当前数据集的分布状况、技术进展及存在的空白领域。 Conclusion: 该综述为研究人员提供了水下声呐图像数据集的全面指南,有助于推动相关数据驱动方法的发展,并为未来数据集的构建提供了方向。 Abstract: Sonar images are relevant for advancing underwater exploration, autonomous navigation, and ecosystem monitoring. However, the progress depends on data availability. The scarcity of publicly available, well-annotated sonar image datasets creates a significant bottleneck for the development of robust machine learning models. This paper presents a comprehensive and concise review of the current landscape of sonar image datasets, seeking not only to catalog existing resources but also to contextualize them, identify gaps, and provide a clear roadmap, serving as a base guide for researchers of any kind who wish to start or advance in the field of underwater acoustic data analysis. We mapped publicly accessible datasets across various sonar modalities, including Side Scan Sonar (SSS), Forward-Looking Sonar (FLS), Synthetic Aperture Sonar (SAS), Multibeam Echo Sounder (MBES), and Dual-Frequency Identification Sonar (DIDSON). An analysis was conducted on applications such as classification, detection, segmentation, and 3D reconstruction. This work focuses on state-of-the-art advancements, incorporating newly released datasets. The findings are synthesized into a master table and a chronological timeline, offering a clear and accessible comparison of characteristics, sizes, and annotation details datasets.

[125] Learned Display Radiance Fields with Lensless Cameras

Ziyang Chen,Yuta Itoh,Kaan Akşit

Main category: cs.CV

TL;DR: 提出了一种无需镜头相机与基于隐式神经表示的算法相结合的方法,用于从多个视角捕捉显示特性,实现无需专业设备的便捷显示校准。

Details Motivation: 现有的显示器校准方法需要专业设备和暗室环境,难以普及,因此需要一种更便捷、低成本的解决方案。 Method: 协同设计了一种无镜头相机和基于隐式神经表示的算法,通过46.6°×37.6°的视锥高效重建显示器发出的光场。 Result: 实现了从多视角对显示器特性的测量,无需专业硬件,支持在常规环境下进行显示校准。 Conclusion: 该方法为实现轻松、低门槛的显示器校准与表征提供了初步但有效的技术路径。 Abstract: Calibrating displays is a basic and regular task that content creators must perform to maintain optimal visual experience, yet it remains a troublesome issue. Measuring display characteristics from different viewpoints often requires specialized equipment and a dark room, making it inaccessible to most users. To avoid specialized hardware requirements in display calibrations, our work co-designs a lensless camera and an Implicit Neural Representation based algorithm for capturing display characteristics from various viewpoints. More specifically, our pipeline enables efficient reconstruction of light fields emitted from a display from a viewing cone of 46.6{\deg} X 37.6{\deg}. Our emerging pipeline paves the initial steps towards effortless display calibration and characterization.

[126] Provenance Networks: End-to-End Exemplar-Based Explainability

Ali Kayyam,Anusha Madan Gopal,M. Anthony Lewis

Main category: cs.CV

TL;DR: 提出了一种名为溯源网络(provenance networks)的新型神经模型,通过将预测直接关联到支持它的训练样本,实现端到端、基于训练数据的可解释性。

Details Motivation: 解决现代深度学习中模型不透明、幻觉和数据贡献归因困难等问题,提升模型的可解释性、鲁棒性和可信度。 Method: 设计一种新型神经网络架构,类似可学习的KNN,在模型推理过程中自动链接预测结果到最相关的训练样本,将可解释性嵌入模型结构中,并联合优化主任务和可解释性目标。 Result: 模型能够解释预测依据的训练样本,支持检测异常标签、验证输入是否在训练集中、增强对输入扰动的鲁棒性,并揭示记忆化与泛化之间的权衡,但计算开销较高且目前仅适用于中等规模数据集。 Conclusion: 溯源网络为深度学习提供了一种内建可解释性的新范式,有效应对模型透明度和信任问题,是对现有可解释技术的有力补充。 Abstract: We introduce provenance networks, a novel class of neural models designed to provide end-to-end, training-data-driven explainability. Unlike conventional post-hoc methods, provenance networks learn to link each prediction directly to its supporting training examples as part of the model's normal operation, embedding interpretability into the architecture itself. Conceptually, the model operates similarly to a learned KNN, where each output is justified by concrete exemplars weighted by relevance in the feature space. This approach facilitates systematic investigations of the trade-off between memorization and generalization, enables verification of whether a given input was included in the training set, aids in the detection of mislabeled or anomalous data points, enhances resilience to input perturbations, and supports the identification of similar inputs contributing to the generation of a new data point. By jointly optimizing the primary task and the explainability objective, provenance networks offer insights into model behavior that traditional deep networks cannot provide. While the model introduces additional computational cost and currently scales to moderately sized datasets, it provides a complementary approach to existing explainability techniques. In particular, it addresses critical challenges in modern deep learning, including model opaqueness, hallucination, and the assignment of credit to data contributors, thereby improving transparency, robustness, and trustworthiness in neural models.

[127] Unified Unsupervised Anomaly Detection via Matching Cost Filtering

Zhe Zhang,Mingxiu Cai,Gaochang Wu,Jing Zhang,Lingqiao Liu,Dacheng Tao,Tianyou Chai,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出了一种统一的无监督异常检测框架UCF,通过构建和过滤异常成本体来提升单模态和多模态场景下的检测性能。

Details Motivation: 现有方法在处理噪声匹配方面存在不足,且单模态与多模态异常检测研究孤立,缺乏统一视角。 Method: 提出Unified Cost Filtering (UCF),利用测试样本的多层注意力引导可学习滤波模块,对任意UAD模型生成的异常成本体进行后处理优化。 Result: 在22个基准数据集上验证了UCF的有效性,显著提升了多种UAD方法的性能,实现了单模态和多模态场景下的最先进结果。 Conclusion: UCF为无监督异常检测提供了一个通用且有效的后处理框架,推动了单模态与多模态方法的统一发展。 Abstract: Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB--3D and RGB--Text, enabled by point cloud sensing and vision--language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB--3D, RGB--Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[128] Visual Language Model as a Judge for Object Detection in Industrial Diagrams

Sanjukta Ghosh

Main category: cs.CV

TL;DR: 本文提出了一种利用视觉语言模型(VLM)评估和改进工业图纸中物体检测结果质量的框架,解决了数字化过程中缺乏自动评估方法的问题。

Details Motivation: 工业图纸(如P&ID)的数字化是构建数字孪生和实现智能工业自动化的重要步骤,但目前缺乏对物体检测结果进行自动质量评估的方法。 Method: 提出一种基于视觉语言模型(VLM)的框架,利用其多模态能力识别检测中的缺失或不一致问题,实现自动质量评估并指导检测结果优化。 Result: 该方法能够有效识别复杂工业图纸中物体检测的错误,并提升整体检测性能。 Conclusion: VLM可用于自动化评估和改进工业图纸的物体检测质量,为工业文档数字化提供了新思路。 Abstract: Industrial diagrams such as piping and instrumentation diagrams (P&IDs) are essential for the design, operation, and maintenance of industrial plants. Converting these diagrams into digital form is an important step toward building digital twins and enabling intelligent industrial automation. A central challenge in this digitalization process is accurate object detection. Although recent advances have significantly improved object detection algorithms, there remains a lack of methods to automatically evaluate the quality of their outputs. This paper addresses this gap by introducing a framework that employs Visual Language Models (VLMs) to assess object detection results and guide their refinement. The approach exploits the multimodal capabilities of VLMs to identify missing or inconsistent detections, thereby enabling automated quality assessment and improving overall detection performance on complex industrial diagrams.

[129] Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Chashi Mahiul Islam,Oteo Mamo,Samuel Jacob Chacko,Xiuwen Liu,Weikuan Yu

Main category: cs.CV

TL;DR: 提出SpatialViLT及其变体,通过融合深度图、3D坐标和边缘图等空间特征,提升视觉-语言模型在3D场景和复杂物体配置中的空间推理能力,在VSR数据集上达到最先进的性能。

Details Motivation: 现有视觉-语言模型在3D场景和复杂物体空间关系推理方面存在不足,需增强其空间理解能力。 Method: 引入SpatialViLT模型,通过多任务学习框架整合深度图、3D坐标和边缘图等空间特征;设计两种变体(SpatialViLT和MaskedSpatialViLT),并提出SpatialEnsemble融合两者。 Result: 模型在VSR数据集的方向性、拓扑和邻近关系等空间推理任务上表现优异,达到最先进水平。 Conclusion: 该工作显著提升了AI系统的空间智能,对多模态理解和实际应用具有重要意义。 Abstract: Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

[130] Denoising of Two-Phase Optically Sectioned Structured Illumination Reconstructions Using Encoder-Decoder Networks

Allison Davis,Yezhi Shen,Xiaoyu Ji,Fengqing Zhu

Main category: cs.CV

TL;DR: 本研究利用合成数据训练编码器-解码器网络(如非对称去噪自编码器和U-Net),用于两相光学切片结构光照明显微图像的伪影抑制,克服了缺乏真实干净标签数据的问题,有效提升图像清晰度。

Details Motivation: 由于传统去噪方法难以消除两相光学切片结构光照明显微(OS-SI)中因快速成像引入的残余伪影,且缺乏干净的真实光学切片数据用于监督训练,亟需一种无需真实标签的深度学习去噪方案。 Method: 采用合成训练策略,通过将真实的伪影场应用于合成图像生成训练样本对;使用非对称去噪自编码器(DAE)和U-Net进行监督训练,并在真实OS-SI图像上评估其去噪性能。 Result: 两种网络均提升了OS-SI图像的清晰度,且各自对不同类型的伪影表现出更强的抑制能力,验证了基于合成数据的监督训练在该场景下的有效性。 Conclusion: 基于合成数据的监督学习能够有效实现OS-SI图像的去噪,编码器-解码器网络有望简化OS-SI的重建流程,为缺乏真实标签的生物成像任务提供可行的深度学习解决方案。 Abstract: Structured illumination (SI) enhances image resolution and contrast by projecting patterned light onto a sample. In two-phase optical-sectioning SI (OS-SI), reduced acquisition time introduces residual artifacts that conventional denoising struggles to suppress. Deep learning offers an alternative to traditional methods; however, supervised training is limited by the lack of clean, optically sectioned ground-truth data. We investigate encoder-decoder networks for artifact reduction in two-phase OS-SI, using synthetic training pairs formed by applying real artifact fields to synthetic images. An asymmetrical denoising autoencoder (DAE) and a U-Net are trained on the synthetic data, then evaluated on real OS-SI images. Both networks improve image clarity, with each excelling against different artifact types. These results demonstrate that synthetic training enables supervised denoising of OS-SI images and highlight the potential of encoder-decoder networks to streamline reconstruction workflows.

[131] PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology

Sejuti Majumder,Saarthak Kapse,Moinak Bhattacharya,Xuan Xu,Alisa Yurovsky,Prateek Prasanna

Main category: cs.CV

TL;DR: PEaRL是一种结合组织病理学与空间转录组学的多模态框架,通过通路激活评分而非单一基因表达来提升跨模态对应性和模型可解释性,在多种癌症数据集中显著优于现有方法。

Details Motivation: 现有整合空间转录组与组织病理图像的方法多依赖高变基因,忽略了塑造组织表型的协同生物学程序,限制了模型的预测范围和生物学可解释性。 Method: 提出PEaRL框架,使用ssGSEA计算通路激活得分表示转录组数据,通过Transformer编码通路信号,并利用对比学习将其与组织学特征对齐,实现降维和增强跨模态关联。 Result: 在乳腺癌、皮肤癌和淋巴结癌三个空间转录组数据集上,PEaRL在基因和通路水平的表达预测准确性均超越现有最先进方法,皮尔逊相关系数最高提升58.9%和20.4%。 Conclusion: 将转录组表示基于生物学通路构建,能生成更符合生物学实际且更具可解释性的多模态模型,推动计算病理学从基因级嵌入向通路级建模发展。 Abstract: Integrating histopathology with spatial transcriptomics (ST) provides a powerful opportunity to link tissue morphology with molecular function. Yet most existing multimodal approaches rely on a small set of highly variable genes, which limits predictive scope and overlooks the coordinated biological programs that shape tissue phenotypes. We present PEaRL (Pathway Enhanced Representation Learning), a multimodal framework that represents transcriptomics through pathway activation scores computed with ssGSEA. By encoding biologically coherent pathway signals with a transformer and aligning them with histology features via contrastive learning, PEaRL reduces dimensionality, improves interpretability, and strengthens cross-modal correspondence. Across three cancer ST datasets (breast, skin, and lymph node), PEaRL consistently outperforms SOTA methods, yielding higher accuracy for both gene- and pathway-level expression prediction (up to 58.9 percent and 20.4 percent increase in Pearson correlation coefficient compared to SOTA). These results demonstrate that grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.

[132] DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Numan Saeed,Tausifa Jan Saleem,Fadillah Maani,Muhammad Ridzuan,Hu Wang,Mohammad Yaqub

Main category: cs.CV

TL;DR: DuPLUS 是一个用于多模态医学图像分析的深度学习框架,通过分层语义提示和双提示机制实现细粒度任务控制,在分割和预后预测任务中展现出强大多任务泛化能力和临床适用性。

Details Motivation: 现有医学影像深度学习模型多为任务特定且缺乏通用性,而现有的通用模型在条件控制和医学语义理解方面表现不足。 Method: 提出 DuPLUS 框架,采用基于文本控制的分层架构和双提示机制,结合视觉-语言模型,支持多模态医学图像分析,并通过参数高效微调实现快速迁移。 Result: 在10个数据集中有8个超越现有最优模型,支持跨3种模态、10个数据集、30多种器官和肿瘤的分割任务,并集成电子健康记录进行预后预测,头颈癌数据集上达到0.69的Concordance Index。 Conclusion: DuPLUS 具备良好的泛化性、可扩展性和临床实用性,是一个高效且通用的医学图像分析解决方案。 Abstract: Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing 'universal' approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

[133] Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms

Lyes Saad Saoud,Loic Lesobre,Enrico Sorato,Irfan Hussain

Main category: cs.CV

TL;DR: 提出一种移动端优化的双阶段深度学习框架,通过线程化YOLOv10和MobileSAM实现野生动物的实时检测与分割,在 Houbara Bustard 数据集上表现优异。

Details Motivation: 野生动物保护需要非侵入式实时监测,但受限于计算资源和动物隐蔽性,现有方法难以满足实时性和准确性需求。 Method: 采用两阶段框架,YOLOv10用于检测,MobileSAM用于轻量级分割,并通过线程化(TDM)并行执行以降低延迟。 Result: 在自建Houbara数据集(4万张图像)上,mAP50达0.9627,MobileSAM mIoU为0.7421,YOLOv10单帧耗时43.7ms,满足实时性要求。 Conclusion: 该方法有效提升野外动物检测与分割的实时性与精度,具备良好的移动端部署潜力,对野生动物保护具有应用价值。 Abstract: Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.

[134] Platonic Transformers: A Solid Choice For Equivariance

Mohammad Mohaiminul Islam,Rishabh Anand,David R. Wessels,Friso de Kruiff,Thijs P. Kuipers,Rex Ying,Clara I. Sánchez,Sharvaree Vadgama,Georg Bökman,Erik J. Bekkers

Main category: cs.CV

TL;DR: 提出了Platonic Transformer,通过引入基于柏拉图固体对称群的参考系注意力机制,在不增加计算成本的情况下实现了对平移和几何对称的等变性,兼具标准Transformer的效率与几何先验。

Details Motivation: Transformer缺乏对科学和视觉中常见的几何对称性的归纳偏置,现有等变方法往往牺牲了Transformer的效率和灵活性。 Method: 在注意力机制中引入基于柏拉图固体对称群的参考帧,构建具有原则性权重共享的注意力机制,并证明其等价于动态群卷积,从而实现等变性并保持原有架构和计算复杂度。 Result: 在CIFAR-10、ScanObjectNN、QM9和OMol25等多个基准上取得了有竞争力的结果,且无需额外计算开销。 Conclusion: Platonic Transformer成功平衡了等变性与计算效率,保留了标准Transformer的结构优势,同时引入了几何归纳偏置,为多领域任务提供了高效且可扩展的解决方案。 Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

[135] Domain Generalization for Semantic Segmentation: A Survey

Manuel Schwonberg,Hanno Gottschalk

Main category: cs.CV

TL;DR: 本文综述了领域泛化语义分割的最新进展,重点分析了向基于基础模型的范式转变,并比较了各类方法的性能,强调了基础模型对领域泛化的重要影响。

Details Motivation: 深度神经网络在跨未知领域的泛化能力仍然有限,尤其是在无法获取目标域数据的场景下,领域泛化(DG)成为关键挑战,尤其在语义分割等实际应用中至关重要。 Method: 对现有领域泛化语义分割方法进行分类和系统性回顾,分析技术演进趋势,并对所有方法进行广泛的性能对比。 Result: 识别出领域泛化正向基于基础模型的范式转变,实验比较表明基础模型显著提升了跨域泛化性能。 Conclusion: 基础模型正在重塑领域泛化语义分割的研究格局,未来研究应进一步探索其潜力与新方向。 Abstract: The generalization of deep neural networks to unknown domains is a major challenge despite their tremendous progress in recent years. For this reason, the dynamic area of domain generalization (DG) has emerged. In contrast to unsupervised domain adaptation, there is no access to or knowledge about the target domains, and DG methods aim to generalize across multiple different unseen target domains. Domain generalization is particularly relevant for the task semantic segmentation which is used in several areas such as biomedicine or automated driving. This survey provides a comprehensive overview of the rapidly evolving topic of domain generalized semantic segmentation. We cluster and review existing approaches and identify the paradigm shift towards foundation-model-based domain generalization. Finally, we provide an extensive performance comparison of all approaches, which highlights the significant influence of foundation models on domain generalization. This survey seeks to advance domain generalization research and inspire scientists to explore new research directions.

[136] From Scope to Script: An Automated Report Generation Model for Gastrointestinal Endoscopy

Evandros Kaklamanos,Kristjana Kristinsdottir,Jonathan Huang,Dustin Carlson,Rajesh Keswani,John Pandolfino,Mozziyar Etemadi

Main category: cs.CV

TL;DR: 提出一种基于transformer的视觉编码器和文本解码器的两阶段训练框架,用于自动生成内镜报告,以减轻医生文档负担。

Details Motivation: 内镜检查的文档工作给胃肠病学家带来巨大负担,导致临床流程低效和医生 burnout。 Method: 采用基于transformer的视觉编码器和文本解码器,通过第一阶段在图像/文本配对数据上预训练,第二阶段在图像/报告配对数据上微调,生成临床有意义的发现。 Result: 该模型能够有效生成临床相关的内镜检查报告,有助于简化文档流程。 Conclusion: 所提出的方法有望减轻医生工作量,改善患者护理效率。 Abstract: Endoscopic procedures such as esophagogastroduodenoscopy (EGD) and colonoscopy play a critical role in diagnosing and managing gastrointestinal (GI) disorders. However, the documentation burden associated with these procedures place significant strain on gastroenterologists, contributing to inefficiencies in clinical workflows and physician burnout. To address this challenge, we propose a novel automated report generation model that leverages a transformer-based vision encoder and text decoder within a two-stage training framework. In the first stage, both components are pre-trained on image/text caption pairs to capture generalized vision-language features, followed by fine-tuning on images/report pairs to generate clinically meaningful findings. Our approach not only streamlines the documentation process but also holds promise for reducing physician workload and improving patient care.

[137] SketchPlan: Diffusion Based Drone Planning From Human Sketches

Sixten Norelius,Aaron O. Feldman,Mac Schwager

Main category: cs.CV

TL;DR: 提出SketchPlan,一种基于扩散模型的规划器,通过在深度图像上解释2D手绘草图来生成无人机导航的3D飞行路径。

Details Motivation: 实现无人机在未知真实环境中的零样本仿真到现实迁移,准确理解人类意图并生成安全的3D飞行路径。 Method: 采用两阶段方法:SketchAdapter将手绘草图映射为2D路径,DiffPath结合深度图像通过扩散模型推断3D轨迹;使用合成数据集(32k路径)和872组人工草图进行训练。 Result: 在模拟和真实场景中均表现出色;真实无人机测试中,在低/中等杂乱环境中任务成功率为100%,在未见高杂乱环境中为40%,优于关键消融实验20-60%。 Conclusion: 结合人工标注与自动标注数据的模块化设计显著提升了模型对人类意图的理解和3D路径推断能力,实现了高效的零样本sim-to-real迁移。 Abstract: We propose SketchPlan, a diffusion-based planner that interprets 2D hand-drawn sketches over depth images to generate 3D flight paths for drone navigation. SketchPlan comprises two components: a SketchAdapter that learns to map the human sketches to projected 2D paths, and DiffPath, a diffusion model that infers 3D trajectories from 2D projections and a first person view depth image. Our model achieves zero-shot sim-to-real transfer, generating accurate and safe flight paths in previously unseen real-world environments. To train the model, we build a synthetic dataset of 32k flight paths using a diverse set of photorealistic 3D Gaussian Splatting scenes. We automatically label the data by computing 2D projections of the 3D flight paths onto the camera plane, and use this to train the DiffPath diffusion model. However, since real human 2D sketches differ significantly from ideal 2D projections, we additionally label 872 of the 3D flight paths with real human sketches and use this to train the SketchAdapter to infer the 2D projection from the human sketch. We demonstrate SketchPlan's effectiveness in both simulated and real-world experiments, and show through ablations that training on a mix of human labeled and auto-labeled data together with a modular design significantly boosts its capabilities to correctly interpret human intent and infer 3D paths. In real-world drone tests, SketchPlan achieved 100\% success in low/medium clutter and 40\% in unseen high-clutter environments, outperforming key ablations by 20-60\% in task completion.

[138] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati,Tai Duc Nguyen,Ekta Prashnani,Koki Nagano,David Luebke,Orazio Gallo,Matthew Stamm

Main category: cs.CV

TL;DR: 提出一种基于姿态条件的大间隔对比编码器,用于检测AI驱动的虚拟头像系统中的身份劫持攻击,无需查看重建视频即可防御生物特征泄露。

Details Motivation: AI驱动的虚拟头像系统存在生物特征被劫持的风险,传统深度伪造检测方法无法应对合成视频流。 Method: 利用姿态-表情潜在编码中包含的身份信息,设计姿态条件下的大间隔对比编码器,分离出持久性身份特征,并通过余弦相似度检测异常身份切换。 Result: 在多个虚拟头像生成模型上验证了方法的有效性,优于现有防御方法,具备实时性和对分布外场景的强泛化能力。 Conclusion: 该方法首次实现了不依赖RGB视频的生物特征泄露防护,能有效检测实时身份劫持,提升AI虚拟头像系统的安全性。 Abstract: AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

[139] Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

Junbao Zhou,Yuan Zhou,Kesen Zhao,Qingshan Xu,Beier Zhu,Richang Hong,Hanwang Zhang

Main category: cs.CV

TL;DR: 提出了一种名为REVEL的新任务和无需训练的DragStream方法,用于实现对自回归视频扩散模型输出的流式、细粒度交互式拖拽操控,有效解决了潜在空间漂移和上下文干扰问题。

Details Motivation: 现有自回归视频扩散模型难以实现流式、细粒度的用户控制,导致生成结果难以持续符合用户预期。为此,需要一种能够在任意时间对任意内容进行交互式编辑的方法。 Method: 提出了REVEL任务框架和DragStream方法,包括自适应分布自校正策略以抑制潜在空间漂移,以及空间-频率选择性优化机制,选择性传播视觉线索以利用上下文信息并减少干扰。该方法无需训练,可无缝集成到现有模型中。 Result: 在多个自回归视频扩散模型上验证了DragStream的有效性,能够稳定实现细粒度的流式视频拖拽操作,显著缓解潜在分布漂移和上下文干扰问题,生成结果更自然且符合用户意图。 Conclusion: DragStream为自回归视频扩散模型提供了高效、灵活的实时交互控制方案,推动了交互式视频编辑技术的发展。 Abstract: Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

[140] GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis

Peiran Quan,Zifan Gu,Zhuo Zhao,Qin Zhou,Donghan M. Yang,Ruichen Rong,Yang Xie,Guanghua Xiao

Main category: cs.CV

TL;DR: 提出了一种名为Group-Aggregative Selection Multi-Instance Learning (GAS-MIL)的灵活集成框架,能够无缝整合多个基础模型的特征,在多个癌症数据集上表现出优越或相当的性能。

Details Motivation: 适应和评估特定诊断任务的基础模型通常耗时且资源密集,尤其是在模型规模和多样性较大的情况下。 Method: 提出GAS-MIL框架,通过多实例学习方法集成多个基础模型的特征,保留其互补优势,无需手动特征选择或大量任务特定微调。 Result: 在前列腺、卵巢和乳腺癌三个数据集上的分类任务中,GAS-MIL始终表现出优于或相当于单个基础模型及现有MIL方法的性能。 Conclusion: GAS-MIL能够高效整合异构基础模型,简化病理学中的模型部署,并为未来的多模态和精准肿瘤学应用提供可扩展的基础。 Abstract: Foundation models (FMs) have transformed computational pathology by providing powerful, general-purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time-consuming and resource-intensive, especially given their scale and diversity. To address this challenge, we introduce Group-Aggregative Selection Multi-Instance Learning (GAS-MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementary strengths without requiring manual feature selection or extensive task-specific fine-tuning. Across classification tasks in three cancer datasets-prostate (PANDA), ovarian (UBC-OCEAN), and breast (TCGA-BrCa)-GAS-MIL consistently achieves superior or on-par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient integration of heterogeneous FMs, GAS-MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.

[141] Real-Time Assessment of Bystander Situation Awareness in Drone-Assisted First Aid

Shen Chang,Renran Tian,Nicole Adams,Nan Kong

Main category: cs.CV

TL;DR: 提出了一种基于视频的实时情境感知(SA)评估框架,利用图嵌入和Transformer模型分析无人机辅助纳洛酮递送模拟数据集(DANDSD),以提升旁观者在阿片类药物过量急救中的响应能力。

Details Motivation: 解决在无人机辅助纳洛酮递送场景中,非专业旁观者与自主系统协作时实时情境感知评估的研究空白。 Method: 构建DANDSD模拟数据集,结合几何、运动学和交互图特征,采用图嵌入与Transformer模型实现对旁观者SA的实时预测。 Result: 所提方法在时间片段分割准确率上优于FINCH基线9%(MoF)和5%(IoU),实现了高性能的SA预测。 Conclusion: 该框架有助于开发能实时引导旁观者的自适应无人机系统,提升急救响应效果,挽救更多生命。 Abstract: Rapid naloxone delivery via drones offers a promising solution for responding to opioid overdose emergencies (OOEs), by extending lifesaving interventions to medically untrained bystanders before emergency medical services (EMS) arrive. Recognizing the critical role of bystander situational awareness (SA) in human-autonomy teaming (HAT), we address a key research gap in real-time SA assessment by introducing the Drone-Assisted Naloxone Delivery Simulation Dataset (DANDSD). This pioneering dataset captures HAT during simulated OOEs, where college students without medical training act as bystanders tasked with administering intranasal naloxone to a mock overdose victim. Leveraging this dataset, we propose a video-based real-time SA assessment framework that utilizes graph embeddings and transformer models to assess bystander SA in real time. Our approach integrates visual perception and comprehension cues--such as geometric, kinematic, and interaction graph features--and achieves high-performance SA prediction. It also demonstrates strong temporal segmentation accuracy, outperforming the FINCH baseline by 9% in Mean over Frames (MoF) and 5% in Intersection over Union (IoU). This work supports the development of adaptive drone systems capable of guiding bystanders effectively, ultimately improving emergency response outcomes and saving lives.

[142] Evaluating OCR performance on food packaging labels in South Africa

Mayimunah Nagayi,Alice Khan,Tamryn Frank,Rina Swart,Clement Nyirenda

Main category: cs.CV

TL;DR: 本研究评估了四种开源OCR系统(Tesseract、EasyOCR、PaddleOCR和TrOCR)在真实食品包装图像上的表现,重点测试其提取成分列表和营养信息的能力。结果显示Tesseract字符错误率最低,EasyOCR在准确性和多语言支持间表现平衡,PaddleOCR覆盖完整但速度慢,TrOCR表现最弱。研究建立了针对包装图像的基准和基线,指出了未来改进方向。

Details Motivation: 食品包装上的准确OCR对合规性和营养监测至关重要,但由于多语言文本、密集布局、多种字体、反光和曲面等因素,实现高精度OCR具有挑战性。因此需要评估现有开源OCR系统在此特定场景下的性能。 Method: 使用包含231种产品(1,628张图像)的数据集测试四个OCR系统,评估其速度和覆盖率;构建包含113张图像(60种产品)的真值子集用于准确性评估,并采用CER、WER、BLEU、ROUGE-L、F1、覆盖率和执行时间等指标进行量化分析。 Result: 在真值子集上,Tesseract的CER最低(0.912),BLEU最高(0.245);EasyOCR在准确性和多语言支持之间表现均衡;PaddleOCR接近完全覆盖但因仅CPU运行而较慢;TrOCR尽管使用GPU加速但结果最差。 Conclusion: 该研究为食品包装OCR提供了特定基准和性能基线,表明当前系统各有优劣,未来需发展更强大的布局感知方法和文本定位技术以应对复杂包装场景。 Abstract: This study evaluates four open-source Optical Character Recognition (OCR) systems which are Tesseract, EasyOCR, PaddleOCR, and TrOCR on real world food packaging images. The aim is to assess their ability to extract ingredient lists and nutrition facts panels. Accurate OCR for packaging is important for compliance and nutrition monitoring but is challenging due to multilingual text, dense layouts, varied fonts, glare, and curved surfaces. A dataset of 231 products (1,628 images) was processed by all four models to assess speed and coverage, and a ground truth subset of 113 images (60 products) was created for accuracy evaluation. Metrics include Character Error Rate (CER), Word Error Rate (WER), BLEU, ROUGE-L, F1, coverage, and execution time. On the ground truth subset, Tesseract achieved the lowest CER (0.912) and the highest BLEU (0.245). EasyOCR provided a good balance between accuracy and multilingual support. PaddleOCR achieved near complete coverage but was slower because it ran on CPU only due to GPU incompatibility, and TrOCR produced the weakest results despite GPU acceleration. These results provide a packaging-specific benchmark, establish a baseline, and highlight directions for layout-aware methods and text localization.

[143] FrameOracle: Learning What to See and How Much to See in Videos

Chaoyu Li,Tianzhi Li,Fei Tao,Zhenyu Zhao,Ziqian Wu,Maozheng Zhao,Juntong Song,Cheng Niu,Pooyan Fazli

Main category: cs.CV

TL;DR: 本文提出了一种名为FrameOracle的轻量级模块,用于提升视频理解中视觉-语言模型的效率,通过智能选择关键帧和确定所需帧数,在减少输入帧数的同时保持甚至提高准确率。

Details Motivation: 现有视频理解中的帧采样策略(如均匀采样或固定数量采样)无法适应信息密度和任务复杂度的变化,导致效率低下和信息丢失,因此需要一种更智能的自适应帧选择方法。 Method: 提出FrameOracle模块,采用四阶段课程学习进行训练:前三个阶段使用跨模态相似性等弱代理信号,最后阶段利用新构建的大规模VideoQA数据集FrameOracle-41K中的关键帧标注进行强监督训练,以预测与查询最相关的帧及其数量。 Result: 在五个视觉-语言模型和六个基准上的实验表明,FrameOracle能将16帧输入平均压缩至10.4帧且不损失精度;从64帧候选中可压缩至平均13.9帧,并提升准确率1.4%,实现了当前最优的效率-精度权衡。 Conclusion: FrameOracle是一种高效、即插即用的帧选择模块,能够根据任务需求动态优化输入帧的数量和内容,显著提升视频理解模型的效率和性能。 Abstract: Vision-language models (VLMs) have advanced video understanding, but their performance is limited by the number of input frames they can process. Existing frame sampling strategies, such as uniform or fixed-budget selection, often fail to adapt to variations in information density or task complexity, resulting in inefficiency and information loss. To address this, we present FrameOracle, a lightweight and plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained using a four-stage curriculum, with the first three stages relying on weak proxy signals such as cross-modal similarity. In the final stage, it leverages stronger supervision from a new dataset we introduce, FrameOracle-41K, the first large-scale VideoQA collection to provide keyframe annotations specifying the minimal set of frames required to answer each question. Extensive experiments across five VLMs and six benchmarks demonstrate that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without any loss in accuracy. When starting from 64-frame candidates, it reduces the input to an average of 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.

[144] A Hybrid Co-Finetuning Approach for Visual Bug Detection in Video Games

Faliu Yi,Sherif Abdelfattah,Wei Huang,Adrian Brown

Main category: cs.CV

TL;DR: 提出一种混合Co-FineTuning(CFT)方法,结合有标签和无标签数据,提升视频游戏视觉缺陷检测的效率和可扩展性。

Details Motivation: 手动识别游戏中的视觉缺陷成本高且耗时,而监督模型依赖大量标注数据,难以广泛应用。 Method: 利用目标游戏和相关游戏的有标签样本,并融合无标签数据进行联合微调,增强特征表示能力。 Result: 该方法在多个游戏环境中优于传统基线模型,即使仅使用50%的目标游戏标注数据仍保持竞争力。 Conclusion: CFT框架有效减少了对标注数据的依赖,具有良好的可扩展性和跨游戏适应性,适用于高效的视觉缺陷检测。 Abstract: Manual identification of visual bugs in video games is a resource-intensive and costly process, often demanding specialized domain knowledge. While supervised visual bug detection models offer a promising solution, their reliance on extensive labeled datasets presents a significant challenge due to the infrequent occurrence of such bugs. To overcome this limitation, we propose a hybrid Co-FineTuning (CFT) method that effectively integrates both labeled and unlabeled data. Our approach leverages labeled samples from the target game and diverse co-domain games, additionally incorporating unlabeled data to enhance feature representation learning. This strategy maximizes the utility of all available data, substantially reducing the dependency on labeled examples from the specific target game. The developed framework demonstrates enhanced scalability and adaptability, facilitating efficient visual bug detection across various game titles. Our experimental results show the robustness of the proposed method for game visual bug detection, exhibiting superior performance compared to conventional baselines across multiple gaming environments. Furthermore, CFT maintains competitive performance even when trained with only 50% of the labeled data from the target game.

[145] Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation

Alexander V. Mantzaris

Main category: cs.CV

TL;DR: 本文探讨了具有两个Transformer风格模块、一步训练、深度监督等特性的层次推理模型(HRM)作为实用图像分类器的潜力。在MNIST上表现良好,但在CIFAR数据集上过拟合严重,泛化能力差,性能不及简单的卷积网络。

Details Motivation: 探索HRM在无数据增强的小分辨率图像分类任务中是否具备竞争力,并分析其优化行为和泛化能力。 Method: 使用DEQ风格的一步训练、Rotary位置嵌入、RMSNorm和深度监督,在MNIST、CIFAR-10和CIFAR-100上评估HRM模型,采用统一的优化设置且不使用数据增强。 Result: HRM在MNIST上达到约98%准确率,但在CIFAR-10上仅65.0%(CNN基线为77.2%),CIFAR-100上为29.7%(基线为45.3%),训练速度慢且严重过拟合。 Conclusion: 当前形式的HRM在小分辨率图像分类任务中不如简单卷积架构,主要因缺乏足够的图像先验;但未来改进模型结构可能提升其性能。 Abstract: This paper asks whether the Hierarchical Reasoning Model (HRM) with the two Transformer-style modules $(f_L,f_H)$, one step (DEQ-style) training, deep supervision, Rotary Position Embeddings, and RMSNorm can serve as a practical image classifier. It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime: no data augmentation, identical optimizer family with one-epoch warmup then cosine-floor decay, and label smoothing. HRM optimizes stably and performs well on MNIST ($\approx 98\%$ test accuracy), but on small natural images it overfits and generalizes poorly: on CIFAR-10, HRM reaches 65.0\% after 25 epochs, whereas a two-stage Conv--BN--ReLU baseline attains 77.2\% while training $\sim 30\times$ faster per epoch; on CIFAR-100, HRM achieves only 29.7\% test accuracy despite 91.5\% train accuracy, while the same CNN reaches 45.3\% test with 50.5\% train accuracy. Loss traces and error analyses indicate healthy optimization but insufficient image-specific inductive bias for HRM in this regime. It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures as the HRM currently exist but this does not exclude possibilities that modifications to the model may allow it to improve greatly.

[146] Unsupervised Transformer Pre-Training for Images: Self-Distillation, Mean Teachers, and Random Crops

Mattia Scardecchia

Main category: cs.CV

TL;DR: 本文综述了自监督学习(SSL)的最新进展,特别是DINOv2方法在视觉特征学习上的突破,分析其核心思想并比较其与其他方法的性能。

Details Motivation: 探讨DINOv2为何能在多种基准上超越弱监督方法,并理解其特征学习能力的来源。 Method: 分析DINOv2的核心技术,包括多裁剪视图增强和基于均值教师的自蒸馏,并回顾其发展历程。 Result: DINOv2在多个下游任务中表现优异,展现出使用Transformer骨干网络时的显著涌现特性。 Conclusion: DINOv2推动了自监督学习的发展,但仍存在局限性,未来研究可进一步探索其潜力。 Abstract: Recent advances in self-supervised learning (SSL) have made it possible to learn general-purpose visual features that capture both the high-level semantics and the fine-grained spatial structure of images. Most notably, the recent DINOv2 has established a new state of the art by surpassing weakly supervised methods (WSL) like OpenCLIP on most benchmarks. In this survey, we examine the core ideas behind its approach, multi-crop view augmentation and self-distillation with a mean teacher, and trace their development in previous work. We then compare the performance of DINO and DINOv2 with other SSL and WSL methods across various downstream tasks, and highlight some remarkable emergent properties of their learned features with transformer backbones. We conclude by briefly discussing DINOv2's limitations, its impact, and future research directions.

[147] Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL

Ruitao Wu,Yifan Zhao,Guangyao Chen,Jia Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Diffusion-Classifier Synergy (DCS)的新框架,通过扩散模型与少样本类增量学习(FSCIL)分类器之间的协同进化,在特征和logits两个层面利用动态奖励机制提升模型对新类别的学习能力和旧知识的保持,显著提高了FSCIL在基准测试上的性能。

Details Motivation: 现有的FSCIL方法由于依赖有限的数据集而难以泛化,且直接应用扩散模型进行数据增强可能导致语义错位或引导无效,因此需要一种能有效结合扩散模型与分类器以克服数据稀缺和稳定性-可塑性困境的方法。 Method: 提出DCS框架,构建扩散模型与FSCIL分类器之间的相互增强循环;设计基于分类器状态的动态多维度奖励函数,分别在特征层面(原型锚定的最大均值差异和维度方差匹配)和logits层面(置信度重校准和跨会话混淆感知机制)指导扩散模型生成图像。 Result: DCS在多个FSCIL基准上实现了最先进的性能,显著提升了新类别的学习效果和旧知识的保留能力,验证了所提方法在缓解灾难性遗忘和增强泛化方面的有效性。 Conclusion: DCS通过分类器与扩散模型的协同训练,实现了有效的增量学习,为解决FSCIL中的数据稀缺和语义一致性问题提供了新思路,展示了生成模型与判别模型联合优化的巨大潜力。 Abstract: Few-Shot Class-Incremental Learning (FSCIL) challenges models to sequentially learn new classes from minimal examples without forgetting prior knowledge, a task complicated by the stability-plasticity dilemma and data scarcity. Current FSCIL methods often struggle with generalization due to their reliance on limited datasets. While diffusion models offer a path for data augmentation, their direct application can lead to semantic misalignment or ineffective guidance. This paper introduces Diffusion-Classifier Synergy (DCS), a novel framework that establishes a mutual boosting loop between diffusion model and FSCIL classifier. DCS utilizes a reward-aligned learning strategy, where a dynamic, multi-faceted reward function derived from the classifier's state directs the diffusion model. This reward system operates at two levels: the feature level ensures semantic coherence and diversity using prototype-anchored maximum mean discrepancy and dimension-wise variance matching, while the logits level promotes exploratory image generation and enhances inter-class discriminability through confidence recalibration and cross-session confusion-aware mechanisms. This co-evolutionary process, where generated images refine the classifier and an improved classifier state yields better reward signals, demonstrably achieves state-of-the-art performance on FSCIL benchmarks, significantly enhancing both knowledge retention and new class learning.

[148] MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations

Jiang Wu,Sichao Wu,Yinsong Ma,Guangyuan Yu,Haoyuan Xu,Lifang Zheng,Jingliang Duan

Main category: cs.CV

TL;DR: 本文提出了一种名为MonitorVLM的视觉-语言框架,用于从监控视频中自动检测矿山等高风险环境中的安全违规行为,显著提升了检测精度和效率。

Details Motivation: 传统人工检查方式劳动强度大、易出错,难以满足大规模动态环境下的安全监控需求,亟需智能化、自动化的解决方案。 Method: 提出MonitorVLM框架,包含三个创新点:构建面向采矿安全的9000样本VQA数据集;设计子句过滤(CF)模块以降低推理延迟;引入行为放大器(BM)模块增强工人区域以提升细粒度动作识别。 Result: 实验表明,MonitorVLM相比基线模型在精确率上提升22.01%,召回率提升34.22%,F1分数提升28.37%;CF模块减少13.56%推理延迟,BM模块提升3.45%精确率和8.62%召回率。 Conclusion: MonitorVLM有效提升了工业场景下安全违规行为的自动检测能力,展示了多模态大模型在职业安全监控中的应用潜力。 Abstract: Industrial accidents, particularly in high-risk domains such as surface and underground mining, are frequently caused by unsafe worker behaviors. Traditional manual inspection remains labor-intensive, error-prone, and insufficient for large-scale, dynamic environments, highlighting the urgent need for intelligent and automated safety monitoring. In this paper, we present MonitorVLM, a novel vision--language framework designed to detect safety violations directly from surveillance video streams. MonitorVLM introduces three key innovations: (1) a domain-specific violation dataset comprising 9,000 vision--question--answer (VQA) samples across 40 high-frequency mining regulations, enriched with augmentation and auxiliary detection cues; (2) a clause filter (CF) module that dynamically selects the Top-$K$ most relevant clauses, reducing inference latency by 13.56\% while maintaining accuracy; and (3) a behavior magnifier (BM) module that enhances worker regions to improve fine-grained action recognition, yielding additional gains of 3.45% in precision and 8.62% in recall. Experimental results demonstrate that MonitorVLM significantly outperforms baseline vision--language models, achieving improvements of 22.01% in precision, 34.22\% in recall, and 28.37% in F1 score over the 72B unfine-tuned baseline. A lightweight web-based interface further integrates MonitorVLM into practical workflows, enabling automatic violation reporting with video timestamping. This study highlights the potential of multimodal large models to enhance occupational safety monitoring in mining and beyond.

[149] A Novel Cloud-Based Diffusion-Guided Hybrid Model for High-Accuracy Accident Detection in Intelligent Transportation Systems

Siva Sai,Saksham Gupta,Vinay Chamola,Rajkumar Buyya

Main category: cs.CV

TL;DR: 提出了一种结合引导分类与扩散技术的新型混合模型,用于智能交通系统中的事故检测,通过细调ExceptionNet输出并利用图像张量作为条件输入,在公开数据集上实现了97.32%的准确率。

Details Motivation: 传统分类方法在处理复杂数据分布时存在局限性,难以有效应对智能交通系统中复杂的事故检测任务。 Method: 提出一种融合指导分类与扩散模型的混合方法,使用微调后的ExceptionNet输出作为扩散模型输入,并引入时间嵌入和图像协变量嵌入来动态调节网络行为;采用多条件模块结构,并在云端部署以提升计算效率。 Result: 在公开数据集上进行实验,该模型在图像式事故检测中达到97.32%的准确率,优于基线模型;并通过消融研究验证了时间步调度、编码方式等关键因素的影响。 Conclusion: 所提出的扩散模型能有效捕捉复杂数据分布,显著提升事故检测性能,具备在智能交通系统中实际应用的潜力。 Abstract: The integration of Diffusion Models into Intelligent Transportation Systems (ITS) is a substantial improvement in the detection of accidents. We present a novel hybrid model integrating guidance classification with diffusion techniques. By leveraging fine-tuned ExceptionNet architecture outputs as input for our proposed diffusion model and processing image tensors as our conditioning, our approach creates a robust classification framework. Our model consists of multiple conditional modules, which aim to modulate the linear projection of inputs using time embeddings and image covariate embeddings, allowing the network to adapt its behavior dynamically throughout the diffusion process. To address the computationally intensive nature of diffusion models, our implementation is cloud-based, enabling scalable and efficient processing. Our strategy overcomes the shortcomings of conventional classification approaches by leveraging diffusion models inherent capacity to effectively understand complicated data distributions. We investigate important diffusion characteristics, such as timestep schedulers, timestep encoding techniques, timestep count, and architectural design changes, using a thorough ablation study, and have conducted a comprehensive evaluation of the proposed model against the baseline models on a publicly available dataset. The proposed diffusion model performs best in image-based accident detection with an accuracy of 97.32%.

[150] SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection

Zhengyi Liu,Xinrui Wang,Xianyong Fang,Zhengzheng Tu,Linbo Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SAMSOD的模型,用于提升RGB-T显著目标检测性能,通过单模态监督和梯度去冲突机制解决模态不平衡和梯度差异问题。

Details Motivation: 现有方法忽略了双模态收敛不平衡以及高低激活神经元间的显著梯度差异,限制了性能提升。 Method: 提出SAMSOD模型,采用单模态监督增强非主导模态学习,利用梯度去冲突减少冲突梯度影响,并通过两个解耦适配器分别掩码高/低激活神经元以强化背景学习。 Result: 在多个RGB-T SOD基准数据集及其他相关任务上进行实验,验证了方法的有效性和泛化能力。 Conclusion: SAMSOD有效提升了RGB-T显著目标检测性能,解决了模态不平衡和梯度冲突问题,具有良好的通用性。 Abstract: RGB-T salient object detection (SOD) aims to segment attractive objects by combining RGB and thermal infrared images. To enhance performance, the Segment Anything Model has been fine-tuned for this task. However, the imbalance convergence of two modalities and significant gradient difference between high- and low- activations are ignored, thereby leaving room for further performance enhancement. In this paper, we propose a model called \textit{SAMSOD}, which utilizes unimodal supervision to enhance the learning of non-dominant modality and employs gradient deconfliction to reduce the impact of conflicting gradients on model convergence. The method also leverages two decoupled adapters to separately mask high- and low-activation neurons, emphasizing foreground objects by enhancing background learning. Fundamental experiments on RGB-T SOD benchmark datasets and generalizability experiments on scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets and full-supervised RGB-D rail surface defect detection all demonstrate the effectiveness of our proposed method.

[151] Referring Expression Comprehension for Small Objects

Kanoko Goto,Takumi Hirose,Mahiro Ukai,Shuhei Kurita,Nakamasa Inoue

Main category: cs.CV

TL;DR: 本文提出了一个针对小物体指代表达理解(REC)的新数据集SOREC和一种渐进式迭代缩放适配器PIZA,显著提升了在驾驶场景中小物体的定位精度。

Details Motivation: 在实际应用如自动驾驶中,精确定位极小物体至关重要,但现有方法在处理小物体时仍面临挑战。 Method: 构建了包含10万对指代表达和小物体边界框的SOREC数据集,并提出PIZA适配模块,用于参数高效微调,使模型能逐步聚焦并定位小物体。 Result: 在SOREC数据集上,将PIZA应用于GroundingDINO模型显著提高了定位准确率。 Conclusion: PIZA结合SOREC为小物体REC任务提供了有效解决方案,推动了视觉-语言模型在真实场景中的应用。 Abstract: Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.

[152] Artery-Vein Segmentation from Fundus Images using Deep Learning

Sharan SK,Subin Sahayam,Umarani Jayaraman,Lakshmi Priya A

Main category: cs.CV

TL;DR: 提出一种基于注意力机制的深度学习模型Attention-WNet,用于视网膜动脉-静脉分割,在HRF和DRIVE数据集上优于现有方法。

Details Motivation: 准确分割视网膜动脉和静脉对于诊断视网膜疾病及评估全身血管健康具有重要意义,但现有方法仍有提升空间。 Method: 在WNet深度学习模型中引入注意力机制,构建新的Attention-WNet模型,用于视网膜动脉-静脉分割。 Result: 在HRF和DRIVE公开数据集上的实验表明,该方法性能优于当前最先进的模型。 Conclusion: Attention-WNet能有效提升视网膜动脉-静脉分割精度,具有良好的应用前景。 Abstract: Segmenting of clinically important retinal blood vessels into arteries and veins is a prerequisite for retinal vessel analysis. Such analysis can provide potential insights and bio-markers for identifying and diagnosing various retinal eye diseases. Alteration in the regularity and width of the retinal blood vessels can act as an indicator of the health of the vasculature system all over the body. It can help identify patients at high risk of developing vasculature diseases like stroke and myocardial infarction. Over the years, various Deep Learning architectures have been proposed to perform retinal vessel segmentation. Recently, attention mechanisms have been increasingly used in image segmentation tasks. The work proposes a new Deep Learning approach for artery-vein segmentation. The new approach is based on the Attention mechanism that is incorporated into the WNet Deep Learning model, and we call the model as Attention-WNet. The proposed approach has been tested on publicly available datasets such as HRF and DRIVE datasets. The proposed approach has outperformed other state-of-art models available in the literature.

[153] Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

Leander Girrbach,Stephan Alaniz,Genevieve Smith,Trevor Darrell,Zeynep Akata

Main category: cs.CV

TL;DR: 本研究通过为LAION-400M数据集添加大规模人物中心注释,揭示了视觉-语言模型中训练数据与人口统计偏差之间的实证联系,发现数据中的共现模式可解释60-70%的性别偏见。

Details Motivation: 视觉-语言模型存在人口统计偏差,但训练数据在其中的作用尚不明确,主要受限于缺乏大规模数据集的人口统计标注。 Method: 构建自动标注流水线,结合目标检测、多模态描述生成和微调分类器,为LAION-400M数据集生成包含2.76亿边界框、性别与种族/族裔标签及自动生成描述的注释。 Result: 发现了显著的人口统计不平衡和有害关联(如将男性或被视为黑人或中东人者与犯罪内容过度关联),并证明CLIP和Stable Diffusion中60-70%的性别偏差可通过数据中的直接共现线性解释。 Conclusion: 训练数据的组成是下游模型偏差的重要来源,本研究提供了首个大规模实证证据,连接数据构成与模型偏见。 Abstract: Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.

[154] Mapping Rio de Janeiro's favelas: general-purpose vs. satellite-specific neural networks

Thomas Hallopeau,Joris Guérin,Laurent Demagistri,Youssef Fouzai,Renata Gracie,Vanderlei Pascoal De Matos,Helen Gurgel,Nadine Dessay

Main category: cs.CV

TL;DR: 比较了通用预训练神经网络和专用卫星图像预训练网络在检测里约热内卢贫民窟中的表现,探讨任务特异性与数据量对城市非正规居住区检测性能的影响。

Details Motivation: 现有深度学习方法尚未充分利用近期预训练神经网络的潜力来检测非正规居住区。 Method: 对比两种预训练神经网络:一种是在大规模多样化图像数据集上预训练的通用网络,另一种是在卫星图像上预训练的专用网络。 Result: 研究发现数据量较大的通用预训练网络可能优于任务更专一但数据较少的专用网络,具体性能取决于应用场景。 Conclusion: 在非正规居住区检测任务中,数据量可能比任务特异性对模型性能影响更大。 Abstract: While deep learning methods for detecting informal settlements have already been developed, they have not yet fully utilized the potential offered by recent pretrained neural networks. We compare two types of pretrained neural networks for detecting the favelas of Rio de Janeiro: 1. Generic networks pretrained on large diverse datasets of unspecific images, 2. A specialized network pretrained on satellite imagery. While the latter is more specific to the target task, the former has been pretrained on significantly more images. Hence, this research investigates whether task specificity or data volume yields superior performance in urban informal settlement detection.

[155] LoRA Patching: Exposing the Fragility of Proactive Defenses against Deepfakes

Zuomin Qu,Yimao Guo,Qianyue Hu,Wei Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为LoRA补丁的新方法,通过在深度伪造生成器中注入可学习的低秩适配补丁,成功绕过现有的主动防御机制,并引入多模态特征对齐损失和门控机制以提升稳定性与效果。

Details Motivation: 现有的深度伪造主动防御方法缺乏鲁棒性和可靠性,容易被绕过,因此需要研究其安全性漏洞并提出更有效的攻击与防御策略。 Method: 提出LoRA补丁方法,将可插拔的LoRA模块注入Deepfake生成器,并设计可学习的门控机制控制补丁影响,防止微调过程中的梯度爆炸;同时引入多模态特征对齐(MMFA)损失,使对抗输出在语义层面与目标输出对齐。此外还提出防御性LoRA补丁,在输出中嵌入可见警告。 Result: 仅使用1,000个人脸样本和单轮微调,LoRA补丁即可成功击败多种最先进的主动防御方法,揭示了当前防御范式的严重弱点。 Conclusion: 当前的主动防御机制在面对基于LoRA的攻击时存在明显脆弱性,未来需要构建更鲁棒的深度伪造防御体系。 Abstract: Deepfakes pose significant societal risks, motivating the development of proactive defenses that embed adversarial perturbations in facial images to prevent manipulation. However, in this paper, we show that these preemptive defenses often lack robustness and reliability. We propose a novel approach, Low-Rank Adaptation (LoRA) patching, which injects a plug-and-play LoRA patch into Deepfake generators to bypass state-of-the-art defenses. A learnable gating mechanism adaptively controls the effect of the LoRA patch and prevents gradient explosions during fine-tuning. We also introduce a Multi-Modal Feature Alignment (MMFA) loss, encouraging the features of adversarial outputs to align with those of the desired outputs at the semantic level. Beyond bypassing, we present defensive LoRA patching, embedding visible warnings in the outputs as a complementary solution to mitigate this newly identified security vulnerability. With only 1,000 facial examples and a single epoch of fine-tuning, LoRA patching successfully defeats multiple proactive defenses. These results reveal a critical weakness in current paradigms and underscore the need for more robust Deepfake defense strategies. Our code is available at https://github.com/ZOMIN28/LoRA-Patching.

[156] The Overlooked Value of Test-time Reference Sets in Visual Place Recognition

Mubariz Zaffar,Liangliang Nan,Sebastian Scherer,Julian F. P. Kooij

Main category: cs.CV

TL;DR: 提出了一种基于测试时参考集微调(RSF)的方法,以缩小训练与测试域之间的差距,显著提升现有最先进视觉位置识别方法在具有挑战性基准上的性能。

Details Motivation: 现有VPR方法在训练和测试环境差异较大时表现不佳,需寻找新信息源来缩小域间差距。 Method: 利用测试时已知的参考图像集(地图)对VPR模型进行微调,即参考集微调(RSF),以适应目标域。 Result: 在多个具有挑战性的基准上,Recall@1平均提升了约2.3%,且微调后模型仍保持良好的泛化能力。 Conclusion: RSF是一种简单有效的方法,可作为现有VPR方法的补充,显著提升其在跨域场景下的性能。 Abstract: Given a query image, Visual Place Recognition (VPR) is the task of retrieving an image of the same place from a reference database with robustness to viewpoint and appearance changes. Recent works show that some VPR benchmarks are solved by methods using Vision-Foundation-Model backbones and trained on large-scale and diverse VPR-specific datasets. Several benchmarks remain challenging, particularly when the test environments differ significantly from the usual VPR training datasets. We propose a complementary, unexplored source of information to bridge the train-test domain gap, which can further improve the performance of State-of-the-Art (SOTA) VPR methods on such challenging benchmarks. Concretely, we identify that the test-time reference set, the "map", contains images and poses of the target domain, and must be available before the test-time query is received in several VPR applications. Therefore, we propose to perform simple Reference-Set-Finetuning (RSF) of VPR models on the map, boosting the SOTA (~2.3% increase on average for Recall@1) on these challenging datasets. Finetuned models retain generalization, and RSF works across diverse test datasets.

[157] Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization

Jiaxin Deng,Junbiao Pang

Main category: cs.CV

TL;DR: 提出ARSAM方法,通过分解并重用Sharpness-Aware Minimization中的梯度分量,在保持SAM泛化性能的同时显著加速训练过程。

Details Motivation: SAM虽能提升模型泛化能力,但因其每步需计算两次梯度,导致计算成本翻倍,亟需一种高效且保持性能的加速方法。 Method: 将SAM的梯度分解为SGD梯度和一阶梯度方向上的二阶投影分量(PSF),发现PSF在训练中动态演化并对寻找平坦极小值至关重要;ARSAM通过自适应采样、重用和混合分解后的梯度来减少冗余计算。 Result: 在CIFAR-10/100上,ARSAM达到与SAM相当的精度,同时提速约40%;在人体姿态估计和模型量化等任务中也表现出加速效果且不损失性能。 Conclusion: ARSAM有效平衡了SAM的高计算成本与模型泛化能力,是一种通用且高效的优化加速方法。 Abstract: Sharpness-Aware Minimization (SAM) improves model generalization but doubles the computational cost of Stochastic Gradient Descent (SGD) by requiring twice the gradient calculations per optimization step. To mitigate this, we propose Adaptively sampling-Reusing-mixing decomposed gradients to significantly accelerate SAM (ARSAM). Concretely, we firstly discover that SAM's gradient can be decomposed into the SGD gradient and the Projection of the Second-order gradient onto the First-order gradient (PSF). Furthermore, we observe that the SGD gradient and PSF dynamically evolve during training, emphasizing the growing role of the PSF to achieve a flat minima. Therefore, ARSAM is proposed to the reused PSF and the timely updated PSF still maintain the model's generalization ability. Extensive experiments show that ARSAM achieves state-of-the-art accuracies comparable to SAM across diverse network architectures. On CIFAR-10/100, ARSAM is comparable to SAM while providing a speedup of about 40\%. Moreover, ARSAM accelerates optimization for the various challenge tasks (\textit{e.g.}, human pose estimation, and model quantization) without sacrificing performance, demonstrating its broad practicality.% The code is publicly accessible at: https://github.com/ajiaaa/ARSAM.

[158] CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis

Yiheng Dong,Yi Lin,Xin Yang

Main category: cs.CV

TL;DR: 提出了一种名为Concept Prompting and Aggregating (CoPA)的新框架,用于在提示引导下捕捉多层概念,提升概念和疾病预测性能。

Details Motivation: 现有基于概念的方法在概念捕捉能力上存在不足,通常仅依赖最后一层特征,忽略了浅层和多尺度特征,且缺乏有效的概念编码指导,限制了细粒度概念的提取。 Method: 提出CoPA框架,通过Concept-aware Embedding Generator (CEG)从视觉编码器的每一层提取概念表示,并将其作为Concept Prompt Tuning (CPT)的提示,引导模型增强关键的概念相关视觉线索,同时聚合各层视觉表示以对齐文本概念表示。 Result: 在三个公开数据集上的实验表明,CoPA在概念和疾病预测任务上优于当前最先进的方法。 Conclusion: CoPA有效捕获并利用图像中的概念级信息,提升了模型的可解释性和诊断性能。 Abstract: The transparency of deep learning models is essential for clinical diagnostics. Concept Bottleneck Model provides clear decision-making processes for diagnosis by transforming the latent space of black-box models into human-understandable concepts. However, concept-based methods still face challenges in concept capture capabilities. These methods often rely on encode features solely from the final layer, neglecting shallow and multiscale features, and lack effective guidance in concept encoding, hindering fine-grained concept extraction. To address these issues, we introduce Concept Prompting and Aggregating (CoPA), a novel framework designed to capture multilayer concepts under prompt guidance. This framework utilizes the Concept-aware Embedding Generator (CEG) to extract concept representations from each layer of the visual encoder. Simultaneously, these representations serve as prompts for Concept Prompt Tuning (CPT), steering the model towards amplifying critical concept-related visual cues. Visual representations from each layer are aggregated to align with textual concept representations. With the proposed method, valuable concept-wise information in the images is captured and utilized effectively, thus improving the performance of concept and disease prediction. Extensive experimental results demonstrate that CoPA outperforms state-of-the-art methods on three public datasets. Code is available at https://github.com/yihengd/CoPA.

[159] Efficiency vs. Efficacy: Assessing the Compression Ratio-Dice Score Relationship through a Simple Benchmarking Framework for Cerebrovascular 3D Segmentation

Shimaa Elbana,Ahmad Kamal,Shahd Ahmed Ali,Ahmad Al-Kabbany

Main category: cs.CV

TL;DR: 本研究探讨了ZFP压缩技术在不降低脑血管自动分割性能的前提下,对大规模3D医学影像数据集进行高效压缩的可行性,结果表明ZFP可在高达22.89:1的压缩比下保持分割精度(Dice系数约0.87656),显著提升医学影像数据共享与协作研究的效率。

Details Motivation: 大规模3D医学影像数据的快速增长带来了存储和共享的挑战,限制了跨机构协作和研究成果的可转移性,因此需要一种既能高效压缩数据又能保持关键分析性能的压缩方法。 Method: 采用ZFP压缩技术的误差容忍模式和固定比率模式,对包含真实血管分割标注的大规模3D医学影像数据集进行压缩,并在压缩后的数据上进行脑血管自动分割,使用Dice系数评估分割结果与原始未压缩数据之间的性能差异。 Result: ZFP在误差容忍模式下实现了最高达22.89:1的压缩比,同时保持平均Dice系数为0.87656(原始基线约为0.8774),表明压缩后数据仍能支持高质量的自动分割任务。 Conclusion: ZFP是一种可行且强大的工具,能够在几乎不损失语义信息的情况下大幅压缩3D医学影像数据,有助于推动大规模医学影像数据的共享与协作研究。 Abstract: The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction--up to a 22.89:1 ratio in error tolerance mode--while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.

[160] MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation

T-Mai Bui,Fares Bougourzi,Fadi Dornaika,Vinh Truong Hoang

Main category: cs.CV

TL;DR: 提出一种结合CNN、Transformer和Mamba的混合分割架构,通过多尺度注意力解码器和协同注意力门机制,在多个医学图像数据集上实现了优于现有方法的性能,同时保持了良好的计算效率。

Details Motivation: 现有医学图像分割模型通常任务特定,跨模态和跨解剖区域泛化能力差,且在准确性和效率之间难以平衡。 Method: 设计了一个三分支编码器(集成CNN、Transformer和基于Mamba的注意力融合机制)以捕获局部、全局和长程依赖,并采用多尺度注意力CNN解码器与协同注意力门增强特征选择和跨尺度交互。 Result: 在多个基准数据集上实验表明,该方法在分割精度和泛化能力上优于当前最先进方法,同时计算复杂度相当。 Conclusion: 所提出的架构在效率与效果之间取得了良好平衡,为多种医学影像分割任务提供了实用且可扩展的解决方案。 Abstract: In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.

[161] Road Damage and Manhole Detection using Deep Learning for Smart Cities: A Polygonal Annotation Approach

Rasel Hossen,Diptajoy Mistry,Mushiur Rahman,Waki As Sami Atikur Rahman Hridoy,Sajib Saha,Muhammad Ibrahim

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv9算法并使用多边形标注的深度学习方法,用于自动检测道路损坏和井盖,构建了一个包含一千多张图像的新数据集,在破损和未破损类别上表现良好,但井盖检测因类别不平衡而效果较差。

Details Motivation: 手动监测道路损坏耗时、成本高且易出错,亟需自动化、高效的解决方案以提升城市基础设施维护水平。 Method: 采用YOLOv9算法,使用多边形标注替代传统的边界框标注,以实现对道路缺陷更精确的定位,并在自建数据集上训练模型识别三类目标:破损、未破损和井盖。 Result: 整体图像级准确率达到78.1%,破损类别F1得分为86.7%,未破损为89.2%,但井盖检测仅为18.2%,主要受限于类别不平衡问题。 Conclusion: 该方法为发展中国家的城市基础设施监测提供了一种高效且可扩展的自动化解决方案,尽管在少数类检测上仍有改进空间。 Abstract: Urban safety and infrastructure maintenance are critical components of smart city development. Manual monitoring of road damages is time-consuming, highly costly, and error-prone. This paper presents a deep learning approach for automated road damage and manhole detection using the YOLOv9 algorithm with polygonal annotations. Unlike traditional bounding box annotation, we employ polygonal annotations for more precise localization of road defects. We develop a novel dataset comprising more than one thousand images which are mostly collected from Dhaka, Bangladesh. This dataset is used to train a YOLO-based model for three classes, namely Broken, Not Broken, and Manhole. We achieve 78.1% overall image-level accuracy. The YOLOv9 model demonstrates strong performance for Broken (86.7% F1-score) and Not Broken (89.2% F1-score) classes, with challenges in Manhole detection (18.2% F1-score) due to class imbalance. Our approach offers an efficient and scalable solution for monitoring urban infrastructure in developing countries.

[162] Contrastive-SDE: Guiding Stochastic Differential Equations with Contrastive Learning for Unpaired Image-to-Image Translation

Venkata Narendra Kotyada,Revanth Eranki,Nagesh Bhattu Sristy

Main category: cs.CV

TL;DR: 本文提出了一种基于时间依赖对比学习的非配对图像到图像翻译方法Contrastive-SDE,结合扩散模型与SimCLR框架,在无需标签监督的情况下实现高效、高质量的跨域生成,实验表明其在多个指标上达到先进水平且收敛更快。

Details Motivation: 非配对图像翻译缺乏对应样本,传统方法难以保持语义一致性且训练效率低,因此需要一种无需监督、能保留领域不变特征并加快收敛的方法。 Method: 提出时间依赖的对比学习方法,将图像与其领域不变特征视为正样本对,通过SimCLR训练对比模型,并用该模型引导预训练SDE进行图像翻译。 Result: 在三个常见的非配对I2I任务中,使用四项指标评估,Contrastive-SDE在多个指标上达到与最先进方法相当的结果,且模型收敛速度显著更快。 Conclusion: Contrastive-SDE是一种高效、无需标签监督的非配对图像翻译方法,结合对比学习与扩散模型的优势,能够在保持语义一致性的同时加速训练过程。 Abstract: Unpaired image-to-image translation involves learning mappings between source domain and target domain in the absence of aligned or corresponding samples. Score based diffusion models have demonstrated state-of-the-art performance in generative tasks. Their ability to approximate complex data distributions through stochastic differential equations (SDEs) enables them to generate high-fidelity and diverse outputs, making them particularly well-suited for unpaired I2I settings. In parallel, contrastive learning provides a powerful framework for learning semantic similarities without the need for explicit supervision or paired data. By pulling together representations of semantically similar samples and pushing apart dissimilar ones, contrastive methods are inherently aligned with the objectives of unpaired translation. Its ability to selectively enforce semantic consistency at the feature level makes contrastive learning particularly effective for guiding generation in unpaired scenarios. In this work, we propose a time-dependent contrastive learning approach where a model is trained with SimCLR by considering an image and its domain invarient feature as a positive pair, enabling the preservation of domain-invariant features and the discarding of domain-specific ones. The learned contrastive model then guides the inference of a pretrained SDE for the I2I translation task. We empirically compare Contrastive-SDE with several baselines across three common unpaired I2I tasks, using four metrics for evaluation. Constrastive-SDE achieves comparable results to the state-of-the-art on several metrics. Furthermore, we observe that our model converges significantly faster and requires no label supervision or classifier training, making it a more efficient alternative for this task.

[163] LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou,Yangming Xu,Guiyao Tie,Yongchao Chen,Guowen Zhang,Duanfeng Chu,Pan Zhou,Lichao Sun

Main category: cs.CV

TL;DR: 本文提出了LIBERO-PRO,一个改进的Vision-Language-Action模型评测基准,通过在四个维度上引入合理扰动,揭示了现有模型在标准评测中表现虚高,实际泛化能力极差,主要依赖记忆而非真正理解任务。

Details Motivation: 现有的LIBERO评测设置存在缺陷,导致模型性能被高估,难以进行公平比较,因此需要更严格的评测基准来真实反映模型的泛化与理解能力。 Method: 构建LIBERO-PRO基准,在对象、初始状态、任务指令和环境四个维度上引入扰动,系统评估模型在非理想条件下的表现。 Result: 实验显示,现有模型在标准LIBERO上准确率超过90%,但在LIBERO-PRO下性能骤降至0.0%,暴露出其依赖记忆动作序列和环境布局,缺乏真正感知与理解能力。 Conclusion: 当前VLA模型的评测方法存在严重问题,应摒弃误导性评测方式,采用更鲁棒的评估标准以推动模型真正理解与泛化能力的发展。 Abstract: LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

[164] Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models

Pranav Sharma,Shivank Garg,Durga Toshniwal

Main category: cs.CV

TL;DR: 本文提出了一个名为Mirage的数据集,包含具有可见伪影的AI生成图像,用于研究当前检测方法的局限性,并探讨大型视觉语言模型(LVLMs)在可解释AI图像检测中的潜力。

Details Motivation: 由于现有AI检测器难以识别日益逼真的AI生成图像,而人类仍能察觉其中的伪影,因此需要探究人与AI检测之间的差异,并开发更有效的检测方法。 Method: 构建了一个包含多种带可见伪影AI生成图像的 curated 数据集Mirage,并在该数据集及现有基准上评估了大型视觉语言模型(LVLMs)在AI图像检测中的表现。 Result: 实验表明,LVLMs在检测具有可见伪影的AI图像时表现良好,但在面对无明显伪影的图像时性能下降。 Conclusion: LVLMs具备作为可解释AI图像检测工具的潜力,但其依赖于可见伪影,在处理更逼真、无明显痕迹的合成图像时仍有局限性。 Abstract: Recent advances in image generation models have led to models that produce synthetic images that are increasingly difficult for standard AI detectors to identify, even though they often remain distinguishable by humans. To identify this discrepancy, we introduce \textbf{Mirage}, a curated dataset comprising a diverse range of AI-generated images exhibiting visible artifacts, where current state-of-the-art detection methods largely fail. Furthermore, we investigate whether Large Vision-Language Models (LVLMs), which are increasingly employed as substitutes for human judgment in various tasks, can be leveraged for explainable AI image detection. Our experiments on both Mirage and existing benchmark datasets demonstrate that while LVLMs are highly effective at detecting AI-generated images with visible artifacts, their performance declines when confronted with images lacking such cues.

[165] UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian,Xin Yin,Chuanhang Deng,Zhiyuan Peng,Jian Xiong,Wei Zhai,Dejing Dou

Main category: cs.CV

TL;DR: 提出了一种统一的视觉接地范式UGround,通过在展开的Transformer中动态选择中间层作为“mask as prompt”,解决了传统方法依赖固定最后一层和使用作为提示缺乏显式空间线索的问题。

Details Motivation: 解决现有视觉接地方法中由于依赖固定最后一层导致误差累积以及提示缺乏明确空间信息的问题。 Method: 提出Policy-Prompted Masking,包含Stochastic Skip Connection(SSC)和Mask as Prompt(MasP):SSC通过强化学习策略动态选择Transformer层进行跳跃连接;MasP利用与图像token的相似性图作为软logit掩码,为SAM提供显式空间提示。 Result: UGround首次在一个框架内统一了多种视觉接地任务,包括指代表达分割、推理分割、单目标到多目标、正向查询到错误前提(空目标)等,验证了其有效性。 Conclusion: UGround通过动态层选择和显式空间提示机制,在多个视觉接地任务上实现了统一且有效的性能提升,展现出更强的灵活性和鲁棒性。 Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

[166] Optimized Minimal 4D Gaussian Splatting

Minseo Lee,Byeonghyeon Lee,Lucas Yunkyu Lee,Eunsoo Lee,Sangmin Kim,Seunghyeon Song,Joo Chan Lee,Jong Hwan Ko,Jaesik Park,Eunbyung Park

Main category: cs.CV

TL;DR: 本文提出OMG4,一种优化的最小化4D高斯点阵方法,通过三阶段渐进式剪枝和压缩技术,在保持高质量重建的同时显著减少存储开销。

Details Motivation: 4D高斯点阵虽能实现实时渲染复杂动态场景,但存在巨大的存储开销问题,现有方法在压缩比或视觉质量上仍有局限。 Method: 提出OMG4框架,包含三个阶段的高斯剪枝:采样关键 primitives、去除冗余、合并相似 primitives;并结合隐式外观压缩和扩展子向量量化(SVQ)到4D表示。 Result: 在标准数据集上实验表明,OMG4相比最新方法可减少60%以上的模型大小,同时保持重建质量。 Conclusion: OMG4在紧凑型4D场景表示方面取得重要进展,为多种应用提供了新可能。 Abstract: 4D Gaussian Splatting has emerged as a new paradigm for dynamic scene representation, enabling real-time rendering of scenes with complex motions. However, it faces a major challenge of storage overhead, as millions of Gaussians are required for high-fidelity reconstruction. While several studies have attempted to alleviate this memory burden, they still face limitations in compression ratio or visual quality. In this work, we present OMG4 (Optimized Minimal 4D Gaussian Splatting), a framework that constructs a compact set of salient Gaussians capable of faithfully representing 4D Gaussian models. Our method progressively prunes Gaussians in three stages: (1) Gaussian Sampling to identify primitives critical to reconstruction fidelity, (2) Gaussian Pruning to remove redundancies, and (3) Gaussian Merging to fuse primitives with similar characteristics. In addition, we integrate implicit appearance compression and generalize Sub-Vector Quantization (SVQ) to 4D representations, further reducing storage while preserving quality. Extensive experiments on standard benchmark datasets demonstrate that OMG4 significantly outperforms recent state-of-the-art methods, reducing model sizes by over 60% while maintaining reconstruction quality. These results position OMG4 as a significant step forward in compact 4D scene representation, opening new possibilities for a wide range of applications. Our source code is available at https://minshirley.github.io/OMG4/.

[167] Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Jyoti Kini,Rohit Gupta,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出了一种新的框架,通过结构化域对齐将地面视角图像中的开放词汇表示迁移到航拍图像的目标检测中,解决了跨域知识迁移中的域偏移、视角变化和尺度差异问题,在多个航拍数据集上实现了显著的零样本性能提升。

Details Motivation: 传统目标检测模型受限于固定类别集合,难以灵活扩展新类别;而跨域(如从地面到航拍)的知识迁移面临域偏移、视角差异和尺度变化等挑战,需要有效的适应策略。 Method: 提出一种结构化域对齐框架,采用对比图像-图像对齐增强航拍与地面视图嵌入的相似性,并利用多实例词汇关联对齐航拍图像与文本嵌入,从而实现开放词汇目标检测。 Result: 在xView、DOTAv2、VisDrone、DIOR和HRRSD等多个航拍数据集上验证了方法有效性,在零样本设置下相比微调的闭集模型分别提升了+6.32 mAP(DOTAv2)、+4.16 mAP(VisDrone)和+3.46 mAP(HRRSD)。 Conclusion: 该方法有效实现了从地面到航拍图像的开放词汇目标检测知识迁移,提升了模型在未见类别上的检测能力,为航拍应用提供了更灵活、可扩展的目标检测方案。 Abstract: Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

[168] Exploring the Challenge and Value of Deep Learning in Automated Skin Disease Diagnosis

Runhao Liu,Ziming Chen,Peng Zhang

Main category: cs.CV

TL;DR: 本文综述了深度学习在皮肤癌诊断中的应用,讨论了当前面临的挑战及创新解决方案,并强调了其在临床工作流程中整合的潜力。

Details Motivation: 提高皮肤癌早期检测和诊断的准确性,改善患者预后。 Method: 基于PRISMA框架进行系统性文献综述,分析数据增强、混合模型和特征融合等方法。 Result: 总结了应对深度学习在皮肤癌诊断中挑战的有效策略,并探讨了模型在临床应用中的集成前景。 Conclusion: 深度学习有潜力革新皮肤病诊断,但仍需持续改进以充分发挥其在皮肤病学护理中的作用。 Abstract: Skin cancer is one of the most prevalent and deadly forms of cancer worldwide, which highlights the critical importance of early detection and diagnosis in improving patient outcomes. Deep learning (DL) has shown significant promise in enhancing the accuracy and efficiency of automated skin disease diagnosis, particularly in detecting and evaluating skin lesions and classification. However, there are still several challenges for DL-based skin cancer diagnosis, including complex features, image noise, intra-class variation, inter-class similarity, and data imbalance. By synthesizing recent research, this review discusses innovative approaches to cope with these challenges, such as data augmentation, hybrid models, and feature fusion, etc. Furthermore, the review highlights the integration of DL models into clinical workflows, offering insights into the potential of deep learning to revolutionize skin disease diagnosis and improve clinical decision-making. This article follows a comprehensive methodology based on the PRISMA framework and emphasizes the need for continued advancements to fully unlock the transformative potential of DL in dermatological care.

[169] SDAKD: Student Discriminator Assisted Knowledge Distillation for Super-Resolution Generative Adversarial Networks

Nikolaos Kaparinos,Vasileios Mezaris

Main category: cs.CV

TL;DR: 提出了一种新的GAN知识蒸馏方法SDAKD,通过引入学生判别器缓解师生网络间的容量不匹配问题,并在超分辨率任务上取得了优于现有方法的性能。

Details Motivation: 由于生成对抗网络(GAN)计算量大,难以部署在资源受限设备上,而现有的知识蒸馏方法因学生生成器与教师判别器之间的容量不匹配导致训练困难,因此需要更有效的压缩方法。 Method: 提出Student Discriminator Assisted Knowledge Distillation (SDAKD),采用三阶段训练策略,并在后两个阶段引入改进的特征图蒸馏方法,同时引入学生判别器来缩小师生模型间的差距。 Result: 在GCFSR和Real-ESRGAN两种超分辨率GAN上验证了SDAKD的有效性,结果表明该方法在多个指标上优于基线及其他SOTA GAN蒸馏方法。 Conclusion: SDAKD能有效缓解GAN知识蒸馏中的容量不匹配问题,显著提升学生生成器的性能,为轻量化GAN部署提供了可行方案。 Abstract: Generative Adversarial Networks (GANs) achieve excellent performance in generative tasks, such as image super-resolution, but their computational requirements make difficult their deployment on resource-constrained devices. While knowledge distillation is a promising research direction for GAN compression, effectively training a smaller student generator is challenging due to the capacity mismatch between the student generator and the teacher discriminator. In this work, we propose Student Discriminator Assisted Knowledge Distillation (SDAKD), a novel GAN distillation methodology that introduces a student discriminator to mitigate this capacity mismatch. SDAKD follows a three-stage training strategy, and integrates an adapted feature map distillation approach in its last two training stages. We evaluated SDAKD on two well-performing super-resolution GANs, GCFSR and Real-ESRGAN. Our experiments demonstrate consistent improvements over the baselines and SOTA GAN knowledge distillation methods. The SDAKD source code will be made openly available upon acceptance of the paper.

[170] PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis

Saja Al-Dabet,Sherzod Turaev,Nazar Zaki,Arif O. Khan,Luai Eldweik

Main category: cs.CV

TL;DR: 本文提出了一个名为PoseGaze-AHP的新型3D数据集,用于眼源性异常头位(AHP)的诊断,该数据集同步捕捉头部姿态和眼球运动信息,并利用大语言模型从医学文献中提取结构化临床数据,通过Neural Head Avatar框架转换为3D表示,包含7,920张图像,覆盖多种眼部状况,提取方法准确率达91.92%,是首个面向AI驱动AHP诊断的公开资源。

Details Motivation: 现有的数据集分别关注头部姿态和眼球运动,限制了综合诊断方法的发展和AI在AHP分析中的应用。 Method: 使用大型语言模型(LLMs)结合逐步、分层和复杂提示策略,从医学文献中提取结构化临床数据,并通过Neural Head Avatar(NHA)框架将数据转换为3D表示。 Result: 成功构建了包含7,920张图像的PoseGaze-AHP数据集,覆盖多种眼部条件,数据提取准确率达到91.92%。 Conclusion: PoseGaze-AHP是首个专为AI驱动的眼源性AHP诊断设计的公开数据集,有助于开发准确且符合隐私保护要求的诊断工具。 Abstract: Diagnosing ocular-induced abnormal head posture (AHP) requires a comprehensive analysis of both head pose and ocular movements. However, existing datasets focus on these aspects separately, limiting the development of integrated diagnostic approaches and restricting AI-driven advancements in AHP analysis. To address this gap, we introduce PoseGaze-AHP, a novel 3D dataset that synchronously captures head pose and gaze movement information for ocular-induced AHP assessment. Structured clinical data were extracted from medical literature using large language models (LLMs) through an iterative process with the Claude 3.5 Sonnet model, combining stepwise, hierarchical, and complex prompting strategies. The extracted records were systematically imputed and transformed into 3D representations using the Neural Head Avatar (NHA) framework. The dataset includes 7,920 images generated from two head textures, covering a broad spectrum of ocular conditions. The extraction method achieved an overall accuracy of 91.92%, demonstrating its reliability for clinical dataset construction. PoseGaze-AHP is the first publicly available resource tailored for AI-driven ocular-induced AHP diagnosis, supporting the development of accurate and privacy-compliant diagnostic tools.

[171] DHQA-4D: Perceptual Quality Assessment of Dynamic 4D Digital Human

Yunhao Li,Sijing Wu,Yucheng Zhu,Huiyu Duan,Zicheng Zhang,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种用于动态4D数字人质量评估的新方法DynaMesh-Rater,并构建了大规模数据集DHQA-4D,结合多模态特征与LoRA微调技术,实现了对带纹理和无纹理4D网格的高质量评估。

Details Motivation: 由于4D数字人网格在采集、压缩和传输过程中易受噪声影响,影响用户体验,因此亟需有效的质量评估方法。 Method: 提出DHQA-4D数据集,包含32个高质量4D人体序列及1920个失真网格;设计DynaMesh-Rater模型,提取2D投影视频的视觉特征、剪辑视频的运动特征和4D网格的几何特征,利用大语言模型(LMM)融合多维特征,并通过LoRA指令微调实现质量评分预测。 Result: 在DHQA-4D数据集上的实验表明,DynaMesh-Rater在带纹理和无纹理4D网格的质量评估方面均优于现有方法。 Conclusion: DynaMesh-Rater能够有效整合多模态特征,显著提升动态4D数字人质量评估的准确性,为相关应用提供了可靠的质量评价工具。 Abstract: With the rapid development of 3D scanning and reconstruction technologies, dynamic digital human avatars based on 4D meshes have become increasingly popular. A high-precision dynamic digital human avatar can be applied to various fields such as game production, animation generation, and remote immersive communication. However, these 4D human avatar meshes are prone to being degraded by various types of noise during the processes of collection, compression, and transmission, thereby affecting the viewing experience of users. In light of this fact, quality assessment of dynamic 4D digital humans becomes increasingly important. In this paper, we first propose a large-scale dynamic digital human quality assessment dataset, DHQA-4D, which contains 32 high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D human meshes degraded by 11 textured distortions, as well as their corresponding textured and non-textured mean opinion scores (MOSs). Equipped with DHQA-4D dataset, we analyze the influence of different types of distortion on human perception for textured dynamic 4D meshes and non-textured dynamic 4D meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model (LMM) based approach that is able to assess both textured 4D meshes and non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts multi-dimensional features, including visual features from a projected 2D video, motion features from cropped video clips, and geometry features from the 4D human mesh to provide comprehensive quality-related information. Then we utilize a LMM model to integrate the multi-dimensional features and conduct a LoRA-based instruction tuning technique to teach the LMM model to predict the quality scores. Extensive experimental results on the DHQA-4D dataset demonstrate the superiority of our DynaMesh-Rater method over previous quality assessment methods.

[172] Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

Runhao Liu,Ziming Chen,Peng Zhang

Main category: cs.CV

TL;DR: 提出一种基于自适应空间特征融合(ASFF)的改进ResNet-50模型,用于皮肤癌分类,显著提升性能。

Details Motivation: 由于皮肤病变图像存在类间相似性高、类内差异大和图像噪声等问题,传统方法在准确分类皮肤癌方面面临挑战。 Method: 在ResNet-50中引入自适应空间特征融合(ASFF)机制,采用双分支结构融合高层语义与中层细节特征,通过全局平均池化和全连接层生成自适应权重进行加权融合,并结合Grad-CAM可视化验证特征关注区域。 Result: 在ISIC 2020子集上达到93.18%准确率,AUC分别为0.9717(ROC)和0.9670(P-R),优于5种经典CNN模型,且Grad-CAM显示模型能聚焦病灶区域并抑制背景干扰。 Conclusion: 所提ASFF-ResNet-50模型有效提升了皮肤癌分类的准确性与鲁棒性,具备良好的临床辅助诊断潜力。 Abstract: Skin cancer classification remains a challenging problem due to high inter-class similarity, intra-class variability, and image noise in dermoscopic images. To address these issues, we propose an improved ResNet-50 model enhanced with Adaptive Spatial Feature Fusion (ASFF), which adaptively integrates multi-scale semantic and surface features to improve feature representation and reduce overfitting. The ResNet-50 model is enhanced with an adaptive feature fusion mechanism to achieve more effective multi-scale feature extraction and improve overall performance. Specifically, a dual-branch design fuses high-level semantic and mid-level detail features, which are processed through global average pooling and fully connected layers to generate adaptive weights for weighted fusion, thereby strengthening feature learning and reducing the impact of noise on classification. The method is evaluated on a subset of the ISIC 2020 dataset containing 3297 benign and malignant skin lesion images. Experimental results show that the proposed ASFF-based ResNet-50 achieves the best overall performance compared with 5 classic convolutional neural networks (CNNs) models. The proposed model reached an accuracy of 93.18% along with higher precision, recall, specificity, and F1 score. The improved model achieves an AUC value of 0.9670 and 0.9717 in the P-R and ROC curve, respectively. Then, the evaluation based on Grad-CAM further proved that the improved model adaptively focuses on lesion-relevant regions while suppressing irrelevant background information, thereby validating its enhanced feature learning capability from a deep representation perspective. These findings demonstrate that the proposed approach provides a more effective and efficient solution for computer-aided skin cancer diagnosis.

[173] Multi-Modal Oral Cancer Detection Using Weighted Ensemble Convolutional Neural Networks

Ajo Babu George,Sreehari J R Ajo Babu George,Sreehari J R Ajo Babu George,Sreehari J R

Main category: cs.CV

TL;DR: 本研究提出了一种多模态深度学习框架,结合临床、影像和组织病理图像,利用DenseNet-121的加权集成模型提升口腔鳞癌(OSCC)的早期检测准确性。

Details Motivation: 由于超过50%的OSCC病例在晚期才被诊断,导致高死亡率,亟需提高早期诊断能力。 Method: 采用迁移学习训练三个模态专用的DenseNet-121 CNN模型,并通过验证集加权集成策略融合预测结果,数据经过增强和模态特定预处理。 Result: 各模态验证准确率分别为影像100%、组织病理95.12%、临床63.10%,集成模型在55个样本的多模态验证集上达到84.58%的整体准确率。 Conclusion: 该多模态集成框架可作为非侵入性AI辅助分诊工具,提升高风险病变的早期识别,支持临床决策,减少诊断延迟,改善患者预后。 Abstract: Aims Late diagnosis of Oral Squamous Cell Carcinoma (OSCC) contributes significantly to its high global mortality rate, with over 50\% of cases detected at advanced stages and a 5-year survival rate below 50\% according to WHO statistics. This study aims to improve early detection of OSCC by developing a multimodal deep learning framework that integrates clinical, radiological, and histopathological images using a weighted ensemble of DenseNet-121 convolutional neural networks (CNNs). Material and Methods A retrospective study was conducted using publicly available datasets representing three distinct medical imaging modalities. Each modality-specific dataset was used to train a DenseNet-121 CNN via transfer learning. Augmentation and modality-specific preprocessing were applied to increase robustness. Predictions were fused using a validation-weighted ensemble strategy. Evaluation was performed using accuracy, precision, recall, F1-score. Results High validation accuracy was achieved for radiological (100\%) and histopathological (95.12\%) modalities, with clinical images performing lower (63.10\%) due to visual heterogeneity. The ensemble model demonstrated improved diagnostic robustness with an overall accuracy of 84.58\% on a multimodal validation dataset of 55 samples. Conclusion The multimodal ensemble framework bridges gaps in the current diagnostic workflow by offering a non-invasive, AI-assisted triage tool that enhances early identification of high-risk lesions. It supports clinicians in decision-making, aligning with global oncology guidelines to reduce diagnostic delays and improve patient outcomes.

[174] Exploring Instruction Data Quality for Explainable Image Quality Assessment

Yunhao Li,Sijing Wu,Huiyu Duan,Yucheng Zhu,Qi Jia,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文挑战了大规模指令调优数据的必要性,提出一种基于聚类的数据选择方法IQA-Select,在仅使用10%数据的情况下超越全量微调的性能,显著降低计算成本。

Details Motivation: 现有可解释图像质量评估(IQA)方法依赖大规模指令调优数据,带来高昂计算成本和数据冗余问题,本文旨在探究数据质量对模型性能的影响,挑战“越大越好”的缩放定律。 Method: 提出三阶段基于聚类的数据选择框架:特征提取、聚类配额分配和采样策略,设计了IQA-Select方法,通过系统分析各阶段选择实现高效数据筛选。 Result: 实验表明,使用仅10%的精选数据,IQA-Select在Q-Bench和AesBench上分别达到全量微调102.1%和103.7%的性能,且优于随机采样子集。 Conclusion: 高质量的指令调优数据比单纯的数据规模更重要,IQA-Select有效提升了训练效率与模型性能,为可解释IQA提供了更高效的数据利用方案。 Abstract: In recent years, with the rapid development of powerful multimodal large language models (MLLMs), explainable image quality assessment (IQA) has gradually become popular, aiming at providing quality-related descriptions and answers of images. To achieve this goal, recent methods seek to construct a large-scale instruction tuning dataset to empower the MLLM with quality perception ability following the well-known scaling law. However, a large amount of instruction tuning data may cause substantial computational costs and redundant data, which in turn will cause harm to the performance of the model. To cope with this problem, in this paper, we challenge the scaling law and systematically investigate the role of data quality of the instruction tuning dataset for explainable IQA. Using a powerful pre-trained MLLM, we first investigate the changes in model performance after fine-tuning with different sizes of instruction tuning data. We find that selecting a subset of the data set randomly using an appropriate ratio can even lead to better results than training with the entire instruction tuning dataset, demonstrating the redundancy of current explainable IQA instruction tuning data. Beyond randomly sampling a subset, we propose a clustering-based data selection framework with three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy. Then we systematically analyze the choices of each stage and propose a simple but efficient data selection method IQA-Select for explainable IQA. The experimental results demonstrate that IQA-Select can achieve 102.1% and 103.7% performance of full fine-tuning using only 10% selected data in Q-Bench and AesBench respectively, significantly reducing computational costs while achieving better performance.

[175] Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

Mingyu Liu,Zheng Huang,Xiaoyi Lin,Muzhi Zhu,Canyu Zhao,Zongze Du,Yating Wang,Haoyi Zhu,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: 提出一种新的视觉语言动作框架,通过解耦规划与执行,利用稀疏3D轨迹作为中间表示,提升跨任务泛化能力。

Details Motivation: 现有VLA模型因数据稀缺和语义模糊导致泛化能力差,且需针对新环境微调,系统协作机制不明确。 Method: 设计一个通用动作专家框架,使用稀疏3D航点作为VLM与低层动作模块之间的桥梁,并引入‘动作预训练+点云微调’的新训练范式。 Result: 实现了无需微调的跨任务迁移,提升了在新环境中的泛化能力和动作执行精度。 Conclusion: 该方法有效解耦了思考与行动,通过中间表示和分阶段训练,显著增强了VLA系统的可扩展性与实用性。 Abstract: Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple "thinking" from "acting", they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel "Action Pre-training, Pointcloud Fine-tuning" paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.

[176] Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models

Md. Atabuzzaman,Andrew Zhang,Chris Thomas

Main category: cs.CV

TL;DR: 提出一种将零样本细粒度图像分类转化为视觉问答任务的新方法,利用大视觉语言模型的理解能力,并通过注意力干预技术提升性能,在多个基准上超越现有最先进方法。

Details Motivation: 探索大视觉语言模型在零样本细粒度图像分类中的潜力,该任务要求对视觉上相似的类别进行精确区分,而现有方法和数据集存在局限。 Method: 将分类任务转化为视觉问答框架,使用详细的类别描述而非直接生成类名,并引入注意力干预技术增强模型表现,同时构建更全面精确的类别描述基准。 Result: 在多个细粒度图像分类基准上进行了广泛实验,所提方法 consistently 超越当前最先进方法。 Conclusion: 所提出的方法有效提升了零样本细粒度分类性能,展示了大视觉语言模型在此类任务中的巨大潜力。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs' comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification

[177] From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance

Ardalan Aryashad,Parsa Razmara,Amin Mahjoub,Seyedarmin Azizi,Mahdi Salmani,Arad Firouzkouhi

Main category: cs.CV

TL;DR: 本文对多种去雾方法在自动驾驶感知系统中的效果进行了系统性实证研究,涵盖传统滤波器、现代去雾网络、组合方法及基于视觉-语言模型(VLM)的图像编辑方法,在Foggy Cityscapes数据集上评估图像质量与下游任务性能,揭示了不同方法的有效性与局限性,并提出以任务为导向的透明基准。

Details Motivation: 自动驾驶感知系统在雾天条件下性能易受干扰,现有去雾方法虽提升图像质量,但未必改善下游检测与分割任务,且多数评估依赖合成数据,缺乏真实场景的可迁移性验证。 Method: 在Foggy Cityscapes数据集上,系统评估四类去雾 pipeline:(i) 经典滤波器,(ii) 现代去雾网络,(iii) 组合方法(滤波+模型或模型+滤波),(iv) 直接应用于雾图的提示驱动视觉-语言模型(VLM)图像编辑方法;评估指标包括图像质量、目标检测(mAP)和实例分割(PQ, RQ, SQ),并引入VLM裁判进行定性评分。 Result: 研究发现某些去雾方法能提升下游任务性能,但组合策略可能产生协同或退化效应;VLM-based编辑器表现接近专用方法;VLM裁判评分与mAP等任务指标高度相关。 Conclusion: 建立了一个透明、以任务为导向的去雾方法基准,明确了预处理在何种条件下能真正提升恶劣天气下的自动驾驶感知性能。 Abstract: Autonomous driving perception systems are particularly vulnerable in foggy conditions, where light scattering reduces contrast and obscures fine details critical for safe operation. While numerous defogging methods exist-from handcrafted filters to learned restoration models-improvements in image fidelity do not consistently translate into better downstream detection and segmentation. Moreover, prior evaluations often rely on synthetic data, leaving questions about real-world transferability. We present a structured empirical study that benchmarks a comprehensive set of pipelines, including (i) classical filters, (ii) modern defogging networks, (iii) chained variants (filter$\rightarrow$model, model$\rightarrow$filter), and (iv) prompt-driven visual--language image editing models (VLM) applied directly to foggy images. Using Foggy Cityscapes, we assess both image quality and downstream performance on object detection (mAP) and segmentation (PQ, RQ, SQ). Our analysis reveals when defogging helps, when chaining yields synergy or degradation, and how VLM-based editors compare to dedicated approaches. In addition, we evaluate qualitative rubric-based scores from a VLM judge and quantify their alignment with task metrics, showing strong correlations with mAP. Together, these results establish a transparent, task-oriented benchmark for defogging methods and highlight the conditions under which preprocessing genuinely improves autonomous perception in adverse weather.

[178] Generating Human Motion Videos using a Cascaded Text-to-Video Framework

Hyelin Nam,Hyojun Go,Byeongjun Park,Byung-Hoon Kim,Hyungjin Chung

Main category: cs.CV

TL;DR: 提出CAMEO,一种级联框架,用于通用人体动作视频生成,通过文本到动作模型与条件视频扩散模型的结合,实现高质量、连贯的文本生成人体视频。

Details Motivation: 现有视频扩散模型在通用人体视频生成方面探索不足,多数局限于图像到视频或特定领域(如舞蹈),缺乏对文本到人体动作视频生成的有效支持。 Method: 构建一个级联框架CAMEO,桥接文本到动作(T2M)模型和条件视频扩散模型(VDM);设计文本与视觉条件的预处理机制以提升训练效果;引入相机感知的条件模块,自动选择与文本一致的视角,增强生成一致性。 Result: 在MovieGen基准和新构建的T2M-VDM专用基准上验证了方法的有效性,展示了其在多样化应用场景中的优越性能和灵活性。 Conclusion: CAMEO通过精心设计的级联结构和条件对齐机制,显著提升了文本到人体动作视频生成的质量与实用性,推动了通用人类视频生成的发展。 Abstract: Human video generation is becoming an increasingly important task with broad applications in graphics, entertainment, and embodied AI. Despite the rapid progress of video diffusion models (VDMs), their use for general-purpose human video generation remains underexplored, with most works constrained to image-to-video setups or narrow domains like dance videos. In this work, we propose CAMEO, a cascaded framework for general human motion video generation. It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs, mitigating suboptimal factors that may arise in this process across both training and inference through carefully designed components. Specifically, we analyze and prepare both textual prompts and visual conditions to effectively train the VDM, ensuring robust alignment between motion descriptions, conditioning signals, and the generated videos. Furthermore, we introduce a camera-aware conditioning module that connects the two stages, automatically selecting viewpoints aligned with the input text to enhance coherence and reduce manual intervention. We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination, while highlighting its versatility across diverse use cases.

[179] OpenFLAME: Federated Visual Positioning System to Enable Large-Scale Augmented Reality Applications

Sagar Bharadwaj,Harrison Williams,Luke Wang,Michael Liang,Tao Jin,Srinivasan Seshan,Anthony Rowe

Main category: cs.CV

TL;DR: 本文提出了OpenFLAME,一个联邦式视觉定位系统(VPS)后端,支持独立机构为其私有空间创建和维护3D扫描定位服务,解决集中式VPS在隐私、覆盖范围和维护上的局限。

Details Motivation: 现有集中式VPS无法覆盖私人室内空间,且存在隐私、法规和维护瓶颈,难以满足未来AR应用对广泛、可控6DoF定位的需求。 Method: 提出联邦式图像定位框架OpenFLAME,通过分片VPS服务实现数据隔离与分布式管理,并设计机制在不共享私有数据的前提下实现跨地图的数据协同与定位一致性。 Result: 实现了可扩展、隐私保护的分布式VPS系统,支持多组织协作、访问控制和服务质量管理,提升了AR定位系统的覆盖范围与可维护性。 Conclusion: OpenFLAME为大规模AR应用提供了一种去中心化、可扩展且尊重隐私的6DoF定位解决方案,推动了室内外无缝定位的发展。 Abstract: World-scale augmented reality (AR) applications need a ubiquitous 6DoF localization backend to anchor content to the real world consistently across devices. Large organizations such as Google and Niantic are 3D scanning outdoor public spaces in order to build their own Visual Positioning Systems (VPS). These centralized VPS solutions fail to meet the needs of many future AR applications -- they do not cover private indoor spaces because of privacy concerns, regulations, and the labor bottleneck of updating and maintaining 3D scans. In this paper, we present OpenFLAME, a federated VPS backend that allows independent organizations to 3D scan and maintain a separate VPS service for their own spaces. This enables access control of indoor 3D scans, distributed maintenance of the VPS backend, and encourages larger coverage. Sharding of VPS services introduces several unique challenges -- coherency of localization results across spaces, quality control of VPS services, selection of the right VPS service for a location, and many others. We introduce the concept of federated image-based localization and provide reference solutions for managing and merging data across maps without sharing private data.

[180] Talking Tennis: Language Feedback from 3D Biomechanical Action Recognition

Arushi Dashore,Aryan Anumala,Emily Hui,Olivia Yang

Main category: cs.CV

TL;DR: 本研究提出了一种结合CNN-LSTM与大语言模型的新框架,从运动数据中提取生物力学特征并生成可操作的自然语言反馈,以提升网球击球分析的可解释性与实用性。

Details Motivation: 现有网球击球分析系统缺乏将生物力学洞察转化为对运动员和教练有意义的可操作语言反馈的能力。 Method: 采用基于CNN-LSTM的模型从运动数据中提取关键生物力学特征(如关节角度、肢体速度和动力链模式),并利用大语言模型(LLM)生成反馈。 Result: 该框架在分类性能和可解释性方面表现良好,能够生成技术准确、基于生物力学且对用户可操作的反馈。 Conclusion: 所提方法成功连接了深度学习、生物力学分析与自然语言反馈,在可解释AI与体育生物力学之间架起了桥梁。 Abstract: Automated tennis stroke analysis has advanced significantly with the integration of biomechanical motion cues alongside deep learning techniques, enhancing stroke classification accuracy and player performance evaluation. Despite these advancements, existing systems often fail to connect biomechanical insights with actionable language feedback that is both accessible and meaningful to players and coaches. This research project addresses this gap by developing a novel framework that extracts key biomechanical features (such as joint angles, limb velocities, and kinetic chain patterns) from motion data using Convolutional Neural Network Long Short-Term Memory (CNN-LSTM)-based models. These features are analyzed for relationships influencing stroke effectiveness and injury risk, forming the basis for feedback generation using large language models (LLMs). Leveraging the THETIS dataset and feature extraction techniques, our approach aims to produce feedback that is technically accurate, biomechanically grounded, and actionable for end-users. The experimental setup evaluates this framework on classification performance and interpretability, bridging the gap between explainable AI and sports biomechanics.

[181] Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs

Sameep Vani,Shreyas Jena,Maitreya Patel,Chitta Baral,Somak Aditya,Yezhou Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为TimeWarp的系统性方法,用于构建针对视频大语言模型(Video-LLMs)的时间理解能力的合成数据集,以提升其在细粒度时序理解任务上的表现。

Details Motivation: 现有Video-LLMs在需要精细时间理解的任务上表现不佳,主要因为当前微调数据集缺乏视觉复杂性和时间细节,导致模型依赖语言推理而非真正理解视频动态。 Method: 提出TimeWarp方法,生成具有复杂时间动态的合成偏好数据集,用于微调模型,使其更关注输入视频中的视觉和时间信息。 Result: 在七个时间理解基准测试中,应用该方法后模型性能显著提升,表现出绝对改进。 Conclusion: TimeWarp生成的数据集能有效增强Video-LLMs的时间理解能力,推动其在复杂时序任务上的发展。 Abstract: While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model's responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.

[182] No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun,Alejandro Lozano,Javier Gamazo Tejero,Vishwesh Nath,Xiao Xiao Sun,James Burgess,Yuhui Zhang,Kun Yuan,Robert Tibshirani,Sean Huver,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 本文研究了在长文本生物医学图像-文本对上预训练视觉-语言模型(VLMs)的影响,提出BIOMEDICA-LongCAP数据集和BMC-LongCLIP模型,通过扩展文本编码器上下文长度至512 tokens,显著提升检索与分类性能。

Details Motivation: 现有VLM通常用短文本(<77词元)预训练,导致长生物医学图注被截断,信息丢失严重。而实际中大量生物医学图注远超此长度,因此需探索长上下文预训练的影响。 Method: 构建包含100万图像-文本对的BIOMEDICA-LongCAP数据集,使用全文章节生成上下文感知的长图注;训练BMC-LongCLIP模型,将文本编码器上下文窗口扩展至512 tokens,利用长图注提供更丰富的监督信号。 Result: 相比短上下文模型,BMC-LongCLIP在长图注检索任务上Recall@1最高提升30%,分类平均提升2%,收敛更快;文本编码器token浪费率从55%降至2.2%。 Conclusion: 长上下文建模能有效利用长格式图注中的额外信息,显著提升生物医学VLM的性能,是该领域有前景的发展方向。 Abstract: Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

[183] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou,Bo Han,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 提出了一种可控伪标签生成(CPG)框架,用于解决长尾分布下未标记数据分布未知的半监督学习问题,通过动态过滤机制和自增强优化循环显著提升性能。

Details Motivation: 现有长尾半监督学习方法假设未标记数据遵循预定义分布,但实际中其分布通常未知且可能任意,限制了模型性能。 Method: 设计了一个可控伪标签生成框架(CPG),包含动态可控过滤机制、基于logit调整的贝叶斯最优分类器、类感知自适应增强模块和辅助分支,形成自强化优化循环。 Result: 在多个基准数据集上验证了CPG的有效性,准确率最高提升15.97%,优于当前最先进方法。 Conclusion: CPG能够有效应对未标记数据任意分布的挑战,通过可控的伪标签扩充和优化循环显著降低泛化误差,具有强鲁棒性和实用性。 Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to \textbf{15.97\%} in accuracy. The code is available at https://github.com/yaxinhou/CPG.

[184] Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

Minh Hoang Nguyen,Su Nguyen Thiet

Main category: cs.CV

TL;DR: 本文提出了一种针对PaddleOCRv5的微调方法,以提升对越南古代汉喃文本的字符识别准确率,实验结果显示在噪声图像下识别效果显著改善,并开发了一个交互式演示工具支持下游研究。

Details Motivation: 现有的OCR系统在处理退化扫描、非标准字形和手写变异方面表现不佳,难以有效识别古典中文(汉喃)文本,限制了越南历史文献数字化和跨语言语义研究的发展。 Method: 通过对PaddleOCRv5的文本识别模块进行微调,使用精心整理的古代越南汉文手稿子集进行再训练,并构建完整的训练流程,包括预处理、LMDB转换、评估和可视化。 Result: 实验结果表明,微调后模型的精确匹配准确率从37.5%提升至50.0%,尤其在噪声图像条件下性能提升明显,并成功开发出可交互的识别对比演示系统。 Conclusion: 该微调方法有效提升了OCR在汉喃古籍上的识别性能,所开发的工具为汉越语义对齐、机器翻译和历史语言学等后续研究提供了实用支持。 Abstract: Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.

[185] Fit Pixels, Get Labels: Meta-learned Implicit Networks for Image Segmentation

Kushal Vyas,Ashok Veeraraghavan,Guha Balakrishnan

Main category: cs.CV

TL;DR: 本文提出了MetaSeg,一种用于医学图像分割的元学习框架,通过隐式神经表示(INR)同时预测像素强度和类别标签,并在少量参数下实现了与U-Net相当的分割性能。

Details Motivation: 隐式神经表示(INR)在信号表示上表现出色,但在语义分割等预测任务中难以直接应用,尤其是在需要学习信号分布上的语义结构时。因此,需要一种能够结合INR紧凑性与高效语义学习能力的新方法。 Method: 提出MetaSeg框架,采用一个共享的INR模型同时预测像素强度和类别标签;通过元学习寻找最优初始参数,使模型在新测试图像上仅需微调即可完成分割。 Result: 在2D和3D脑部MRI分割任务中,MetaSeg取得了与U-Net相当的Dice分数,但参数量减少了90%。 Conclusion: MetaSeg为医学图像分割提供了一种参数效率高、可扩展的新方法,显著优于传统重型模型如U-Net和视觉Transformer。 Abstract: Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically decode its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90\%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision transformers for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .

[186] Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

Chendong Wang,Donglin Bai,Yifan Yang,Xiao Jin,Anlan Zhang,Rui Wang,Shiqi Jiang,Yuqing Yang,Hao Wu,Qi Dai,Chong Luo,Ting Cao,Lili Qiu,Suman Banerjee

Main category: cs.CV

TL;DR: 提出Video-in-the-Loop (ViTL)框架,通过两阶段方法在固定token预算下实现高效的长视频问答,结合时间定位与答案生成,并引入带时间标注的span-grounded数据集。

Details Motivation: 现有长视频问答方法在固定token预算下难以兼顾效率与精度,缺乏对关键时间片段的有效定位和解释性。 Method: ViTL采用两阶段框架:先用低帧率扫描定位问题相关视频区间,再通过跨模态注意力机制动态重分配视觉token至高帧率进行细粒度推理;使用交错分组相对损失联合优化时间定位与答案准确率。 Result: 在Charades-STA、ActivityNet-Captions等任务上,ViTL在50%更少帧输入下最高提升8.6%性能,消融实验显示基于span感知的token重分配优于均匀采样。 Conclusion: ViTL结合span-grounded标注数据与端到端联合训练,在保持计算效率的同时提升了长视频问答的准确性和可解释性,为可扩展的长视频理解提供了有效方案。 Abstract: We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

[187] Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation

Yuyan Bu,Qiang Sheng,Juan Cao,Shaofei Wang,Peng Qi,Yuhui Shi,Beizhe Hu

Main category: cs.CV

TL;DR: 提出了一种名为AgentAug的数据增强框架,通过模拟典型的创作过程生成多样化的短视频虚假新闻,以解决现有检测器因训练数据不足和模式偏差导致性能受限的问题。

Details Motivation: 现有的虚假新闻检测器依赖模式特征,但由于训练数据有限且多样性不足,导致模型性能受限,尤其是在复杂的多对多视频片段与虚假新闻事件关系下表现不佳。 Method: 设计了一个基于大语言模型(LLM)驱动的多管道数据增强框架AgentAug,涵盖四类伪造新闻视频的生成方式,并结合基于不确定性采样的主动学习策略选择有用的增强样本。 Result: 在两个基准数据集上的实验表明,AgentAug能持续提升短视频虚假新闻检测器的性能。 Conclusion: AgentAug通过模拟真实创作过程生成多样化虚假视频样本,有效缓解了数据稀缺和模式偏移问题,提升了检测模型的泛化能力。 Abstract: The emergence of fake news on short video platforms has become a new significant societal concern, necessitating automatic video-news-specific detection. Current detectors primarily rely on pattern-based features to separate fake news videos from real ones. However, limited and less diversified training data lead to biased patterns and hinder their performance. This weakness stems from the complex many-to-many relationships between video material segments and fabricated news events in real-world scenarios: a single video clip can be utilized in multiple ways to create different fake narratives, while a single fabricated event often combines multiple distinct video segments. However, existing datasets do not adequately reflect such relationships due to the difficulty of collecting and annotating large-scale real-world data, resulting in sparse coverage and non-comprehensive learning of the characteristics of potential fake news video creation. To address this issue, we propose a data augmentation framework, AgentAug, that generates diverse fake news videos by simulating typical creative processes. AgentAug implements multiple LLM-driven pipelines of four fabrication categories for news video creation, combined with an active learning strategy based on uncertainty sampling to select the potentially useful augmented samples during training. Experimental results on two benchmark datasets demonstrate that AgentAug consistently improves the performance of short video fake news detectors.

[188] Prompt-to-Prompt: Text-Based Image Editing Via Cross-Attention Mechanisms -- The Research of Hyperparameters and Novel Mechanisms to Enhance Existing Frameworks

Linn Bieske,Carla Lorente

Main category: cs.CV

TL;DR: 本研究通过优化超参数,提升基于提示的图像编辑框架的精确性和可靠性,提出“CL P2P”框架以解决循环不一致等问题。

Details Motivation: 现有基于稳定扩散模型的文本驱动图像编辑存在结果不一致(如发色变化不稳定)的问题,需提高编辑的精确性与可控性。 Method: 研究“词交换”方法,开发“注意力重加权方法”,并提出“CL P2P”框架,系统分析超参数与注意力机制对生成图像的影响。 Result: 提升了图像编辑的适应性和一致性,有效缓解了循环不一致等现有问题,增强了生成图像的质量和结构稳定性。 Conclusion: 通过优化超参数与改进注意力机制,可显著提升文本驱动图像编辑的精度与可靠性,为模型架构与参数设置的协同优化提供了新思路。 Abstract: Recent advances in image editing have shifted from manual pixel manipulation to employing deep learning methods like stable diffusion models, which now leverage cross-attention mechanisms for text-driven control. This transition has simplified the editing process but also introduced variability in results, such as inconsistent hair color changes. Our research aims to enhance the precision and reliability of prompt-to-prompt image editing frameworks by exploring and optimizing hyperparameters. We present a comprehensive study of the "word swap" method, develop an "attention re-weight method" for better adaptability, and propose the "CL P2P" framework to address existing limitations like cycle inconsistency. This work contributes to understanding and improving the interaction between hyperparameter settings and the architectural choices of neural network models, specifically their attention mechanisms, which significantly influence the composition and quality of the generated images.

[189] \textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Bin Lei,Nuo Xu,Ali Payani,Mingyi Hong,Chunhua Liao,Yu Cao,Caiwen Ding

Main category: cs.CV

TL;DR: 提出GUI-Spotlight模型,通过图像接地推理和调用专用工具提升GUI系统中的视觉定位准确性。

Details Motivation: 现有MLLM在GUI系统中因视觉定位不可靠而限制了实际应用,难以准确执行点击或拖拽等操作。 Method: 训练一个能够动态调用多个专用工具的模型,逐步聚焦屏幕相关区域,实现更精确的视觉接地。 Result: 在ScreenSpot-Pro基准上,仅用18.5K样本训练的GUI-Spotlight达到52.8%准确率,超过使用更多数据训练的V2P-7B和GTA-1-7B。 Conclusion: GUI-Spotlight显著提升了视觉接地精度,为MLLM在复杂真实环境中的GUI交互提供了更可靠的解决方案。 Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

[190] Quantization Range Estimation for Convolutional Neural Networks

Bingtao Yang,Yujia Wang,Mengzhi Jiao,Hongwei Huo

Main category: cs.CV

TL;DR: 本文提出了一种基于范围估计的后训练量化方法,通过优化层间局部最小值来减少量化误差,并在变换后的权重空间中应用搜索算法进一步提升性能,在ResNet系列和Inception-v3模型上实现了接近无损的8位和6位量化,并显著改善了4位量化的准确率。

Details Motivation: 低比特量化在保持模型精度的同时减少深度神经网络存储的需求具有挑战性,现有方法在低比特设置下容易导致显著精度损失。 Method: 将范围估计建模为最小化量化误差的优化问题,证明其具有局部凸性,并提出一种高效的搜索算法;进一步将该算法应用于变换后的权重空间以提升实际效果。 Result: 在图像分类任务中,ResNet系列和Inception-v3模型的8位和6位量化几乎无精度损失,4位量化精度也显著优于现有方法。 Conclusion: 所提方法在多种模型上显著提升了后训练量化的性能,实现了高精度的低比特模型压缩,具备实用价值。 Abstract: Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.

[191] MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

Zhenyu Pan,Yucheng Lu,Han Liu

Main category: cs.CV

TL;DR: MetaFind是一个面向元宇宙场景生成的三模态组合检索框架,通过文本、图像和3D模态的任意组合查询,提升3D资产检索的空间与风格一致性。

Details Motivation: 现有3D资产检索方法缺乏针对场景生成的标准化范式,且常忽略空间、语义和风格约束,导致检索结果不一致。 Method: 提出MetaFind框架,引入可插拔的等变布局编码器ESSGNN,联合建模对象级特征(如外观)和场景级布局结构,支持多模态输入和迭代式场景构建。 Result: 实验表明,MetaFind在多种检索任务中均优于基线方法,显著提升了检索结果的空间关系准确性和风格一致性。 Conclusion: MetaFind为元宇宙中的场景生成提供了一种灵活、高效的3D资产检索方案,具备良好的实用性和扩展性。 Abstract: We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.

[192] Ordinal Encoding as a Regularizer in Binary Loss for Solar Flare Prediction

Chetraj Pandey,Jinsu Hong,Anli Ji,Rafal A. Angryk,Berkay Aydin

Main category: cs.CV

TL;DR: 提出一种结合序数信息的改进损失函数,以增强太阳耀斑预测模型在二分类任务中的性能。

Details Motivation: 传统的二分类框架忽略了耀斑类别内部子类之间的序数关系,导致靠近阈值的误分类频繁发生。 Method: 在传统的二元交叉熵(BCE)损失中引入序数信息,构建一种序数感知的损失函数,对靠近预测阈值的错误预测施加更高的惩罚。 Result: 该方法作为数据驱动的正则化手段,能够减轻模型在阈值附近分类困难的问题,提升模型整体性能。 Conclusion: 通过在损失函数中融入序数关系,有效利用了数据的内在结构,增强了太阳耀斑预测模型的学习能力与准确性。 Abstract: The prediction of solar flares is typically formulated as a binary classification task, distinguishing events as either Flare (FL) or No-Flare (NF) according to a specified threshold (for example, greater than or equal to C-class, M-class, or X-class). However, this binary framework neglects the inherent ordinal relationships among the sub-classes contained within each category (FL and NF). Several studies on solar flare prediction have empirically shown that the most frequent misclassifications occur near this prediction threshold. This suggests that the models struggle to differentiate events that are similar in intensity but fall on opposite sides of the binary threshold. To mitigate this limitation, we propose a modified loss function that integrates the ordinal information among the sub-classes of the binarized flare labels into the conventional binary cross-entropy (BCE) loss. This approach serves as an ordinality-aware, data-driven regularization method that penalizes the incorrect predictions of flare events in close proximity to the prediction threshold more heavily than those away from the boundary during model optimization. By incorporating ordinal weighting into the loss function, we aim to enhance the model's learning process by leveraging the ordinal characteristics of the data, thereby improving its overall performance.

[193] QuantDemoire: Quantization with Outlier Aware for Image Demoiréing

Zheng Chen,Kewei Zhang,Xiaoyang Liu,Weihang Zhang,Mengfan Wang,Yifan Fu,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种针对去摩尔纹任务的后训练量化框架QuantDemoire,有效解决了低比特量化中的分布异常值和光滑区域表示弱化问题,在大幅降低模型开销的同时保持了优异性能。

Details Motivation: 现有的量化方法在去摩尔纹模型上会导致严重性能下降,主要由于激活和权重中的分布异常值以及平滑区域表示能力减弱,限制了其在边缘设备上的部署。 Method: 提出了QuantDemoire,包含两个关键组件:一是异常值感知量化器,采用基于采样的范围估计并保留少量极端权重为FP16;二是频率感知校准策略,在微调中强调低频和中频成分以减少量化带来的条带伪影。 Result: 实验表明,QuantDemoire在W4A4设置下比现有量化方法性能高出4 dB以上,显著减少了参数量和计算量,同时保持了图像质量。 Conclusion: QuantDemoire为去摩尔纹模型提供了高效且高性能的量化方案,推动了该类模型在边缘设备上的实际应用。 Abstract: Demoir\'eing aims to remove moir\'e artifacts that often occur in images. While recent deep learning-based methods have achieved promising results, they typically require substantial computational resources, limiting their deployment on edge devices. Model quantization offers a compelling solution. However, directly applying existing quantization methods to demoir\'eing models introduces severe performance degradation. The main reasons are distribution outliers and weakened representations in smooth regions. To address these issues, we propose QuantDemoire, a post-training quantization framework tailored to demoir\'eing. It contains two key components. **First}, we introduce an outlier-aware quantizer to reduce errors from outliers. It uses sampling-based range estimation to reduce activation outliers, and keeps a few extreme weights in FP16 with negligible cost. **Second**, we design a frequency-aware calibration strategy. It emphasizes low- and mid-frequency components during fine-tuning, which mitigates banding artifacts caused by low-bit quantization. Extensive experiments validate that our QuantDemoire achieves large reductions in parameters and computation while maintaining quality. Meanwhile, it outperforms existing quantization methods by over **4 dB** on W4A4. Code is released at: https://github.com/zhengchen1999/QuantDemoire.

[194] Diffusion Low Rank Hybrid Reconstruction for Sparse View Medical Imaging

Zongyin Deng,Qing Zhou,Yuhao Fang,Zijian Wang,Yao Lu,Ye Zhang,Chun Li

Main category: cs.CV

TL;DR: TV-LoRA是一种结合扩散生成先验与多正则化约束的低剂量稀疏视图CT重建新方法,在极稀疏条件下显著提升图像质量与重建效率。

Details Motivation: 为解决极稀疏视图下CT重建的病态性、纹理丢失和伪影问题,需融合生成模型先验与物理约束以提升重建质量。 Method: 提出TV-LoRA方法,结合NCSN++扩散先验与各向异性TV、低秩自注意力(LoRA)正则化,在ADMM框架下实现重建;采用2D切片策略、FFT加速与张量并行优化提升推理效率。 Result: 在AAPM-2016、CTHD和LIDC数据集上,8、4、2视图条件下TV-LoRA在SSIM、纹理恢复、边缘清晰度和伪影抑制方面均优于基准方法,且具备良好鲁棒性与泛化能力;消融实验验证了LoRA与扩散先验的互补性,FFT-PCG模块显著提速。 Conclusion: TV-LoRA通过融合扩散先验与多正则化约束,实现了高质量、高效率的3D CT重建,在低剂量稀疏采样场景中具有广泛临床应用前景。 Abstract: This work presents TV-LoRA, a novel method for low-dose sparse-view CT reconstruction that combines a diffusion generative prior (NCSN++ with SDE modeling) and multi-regularization constraints, including anisotropic TV and nuclear norm (LoRA), within an ADMM framework. To address ill-posedness and texture loss under extremely sparse views, TV-LoRA integrates generative and physical constraints, and utilizes a 2D slice-based strategy with FFT acceleration and tensor-parallel optimization for efficient inference. Experiments on AAPM-2016, CTHD, and LIDC datasets with $N_{\mathrm{view}}=8,4,2$ show that TV-LoRA consistently surpasses benchmarks in SSIM, texture recovery, edge clarity, and artifact suppression, demonstrating strong robustness and generalizability. Ablation studies confirm the complementary effects of LoRA regularization and diffusion priors, while the FFT-PCG module provides a speedup. Overall, Diffusion + TV-LoRA achieves high-fidelity, efficient 3D CT reconstruction and broad clinical applicability in low-dose, sparse-sampling scenarios.

[195] TOPO-Bench: An Open-Source Topological Mapping Evaluation Framework with Quantifiable Perceptual Aliasing

Jiaming Wang,Diwen Liu,Jizhuo Chen,Harold Soh

Main category: cs.CV

TL;DR: 本文提出了一种用于拓扑映射的标准化评估协议,通过形式化拓扑一致性并引入定位精度作为代理指标,同时首次提出了量化数据集歧义性的方法,并发布了具有校准歧义水平的基准数据集和开源工具,以促进该领域的一致性和可重复性研究。

Details Motivation: 由于缺乏标准化的评估指标、数据集和协议,以及感知别名问题未被充分量化,拓扑映射领域的进展受到阻碍。 Method: 将拓扑一致性形式化为拓扑图的基本属性,使用定位精度作为可解释的替代度量,并提出首个量化数据集歧义性的方法;构建多样化的基准数据集,实现并开源深度学习基线系统与经典方法进行比较。 Result: 实验表明当前方法在感知别名条件下存在局限性,所提出的评估协议和数据集支持跨环境的公平比较,且所有资源均开源。 Conclusion: 该工作为拓扑映射提供了可复现的研究框架,通过标准化评估和量化环境歧义性推动了该领域的进步。 Abstract: Topological mapping offers a compact and robust representation for navigation, but progress in the field is hindered by the lack of standardized evaluation metrics, datasets, and protocols. Existing systems are assessed using different environments and criteria, preventing fair and reproducible comparisons. Moreover, a key challenge - perceptual aliasing - remains under-quantified, despite its strong influence on system performance. We address these gaps by (1) formalizing topological consistency as the fundamental property of topological maps and showing that localization accuracy provides an efficient and interpretable surrogate metric, and (2) proposing the first quantitative measure of dataset ambiguity to enable fair comparisons across environments. To support this protocol, we curate a diverse benchmark dataset with calibrated ambiguity levels, implement and release deep-learned baseline systems, and evaluate them alongside classical methods. Our experiments and analysis yield new insights into the limitations of current approaches under perceptual aliasing. All datasets, baselines, and evaluation tools are fully open-sourced to foster consistent and reproducible research in topological mapping.

[196] Learning Efficient Meshflow and Optical Flow from Event Cameras

Xinglong Luo,Ao Luo,Kunming Luo,Zhengning Wang,Ping Tan,Bing Zeng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出了事件相机下的网格流估计新任务,构建了首个高分辨率事件网格流数据集HREM,并设计了轻量级EEMFlow网络实现快速准确的网格流估计,同时提出ADM模块提升模型在不同事件密度下的泛化能力。

Details Motivation: 现有研究缺乏针对网格流的事件数据集和方法,且对事件数据密度变化的鲁棒性研究不足,限制了事件相机在运动估计中的应用。 Method: 构建了HREM和HREM+多密度事件数据集;设计了高效的EEMFlow网络及其支持光流的EEMFlow+版本,引入CDC模块保持运动边界清晰;提出ADM模块自适应调整输入事件密度。 Result: EEMFlow比当前最优方法快30倍且性能优异;ADM使EEMFlow和EEMFlow+性能分别提升8%和10%;HREM数据集具有高分辨率、复杂运动和动态物体等优势。 Conclusion: 所提方法在事件驱动的网格流估计中实现了高效、精确和鲁棒的性能,推动了事件相机在运动感知领域的应用。 Abstract: In this paper, we explore the problem of event-based meshflow estimation, a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start, we review the state-of-the-art in event-based flow estimation, highlighting two key areas for further research: i) the lack of meshflow-specific event datasets and methods, and ii) the underexplored challenge of event data density. First, we generate a large-scale High-Resolution Event Meshflow (HREM) dataset, which showcases its superiority by encompassing the merits of high resolution at 1280x720, handling dynamic objects and complex motion patterns, and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides, we propose Efficient Event-based MeshFlow (EEMFlow) network, a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore, we upgrade EEMFlow network to support dense event optical flow, in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30x faster) of our EEMFlow model compared to the recent state-of-the-art flow method. As an extension, we expand HREM into HREM+, a multi-density event dataset contributing to a thorough study of the robustness of existing methods across data with varying densities, and propose an Adaptive Density Module (ADM) to adjust the density of input event data to a more optimal range, enhancing the model's generalization ability. We empirically demonstrate that ADM helps to significantly improve the performance of EEMFlow and EEMFlow+ by 8% and 10%, respectively. Code and dataset are released at https://github.com/boomluo02/EEMFlowPlus.

[197] Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

Seunghyun Lee,Tae-Kyun Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的扩散模型方法,用于类别级6D物体姿态估计,通过预训练编码器和时间依赖的分数缩放采样引导,显著提高了训练收敛速度和精度,无需额外的姿态评估网络,在多个基准上实现了最先进的性能。

Details Motivation: 现有扩散模型在6D物体姿态估计中存在训练收敛慢、需端到端学习编码器以及依赖额外网络筛选姿态候选的问题。 Method: 1)预训练编码器并联合优化回归头和去噪扩散头;2)引入时间依赖的分数缩放采样引导,平衡探索与利用。 Result: 在REAL275、HouseCat6D和ROPE等多个基准上达到最先进精度,且训练和推理更高效,单姿态推断即可实现高性能。 Conclusion: 所提方法简单有效,解决了现有方法的收敛性和效率问题,在准确性和计算效率方面均优于现有方法。 Abstract: Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.

[198] Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs

Xiaoyu Yang,Jie Lu,En Yu

Main category: cs.CV

TL;DR: 本文提出了一种针对多模态大语言模型知识蒸馏中概念漂移问题的新方法,通过“学习、比较、批判”范式和自主偏好优化(APO)实现学生模型的推理一致性与鲁棒性提升,并发布了大规模X光数据集CXR-MAX。

Details Motivation: 多教师MLLM在知识蒸馏中的推理轨迹存在概念漂移问题,导致学生模型继承偏差,影响其一致性、鲁棒性和泛化能力。 Method: 建立概念漂移与知识蒸馏的理论联系,提出“学习、比较、批判”范式,利用多教师推理流进行自我蒸馏,并通过自主偏好优化(APO)实现概念对齐。 Result: 实验表明该方法在一致性、鲁棒性和泛化性方面优于现有蒸馏方法,并发布了包含17万+推理轨迹的CXR-MAX数据集。 Conclusion: 通过引入概念漂移视角和APO框架,有效缓解了多教师蒸馏中的偏差传播,提升了学生模型的稳定性和泛化能力。 Abstract: This paper identifies a critical yet underexplored challenge in distilling from multimodal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories.Guided by concept drift, we introduce the "learn, compare, critique" paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model.Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset, CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/Autonomous-Distillation/.

[199] Automating construction safety inspections using a multi-modal vision-language RAG framework

Chenxin Wang,Elyas Asadi Shamsabadi,Zhaohui Chen,Luming Shen,Alireza Ahmadian Fard Fini,Daniel Dias-da-Costa

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大视觉语言模型和检索增强生成(RAG)的框架SiteShield,用于自动化建筑安全检查报告生成。

Details Motivation: 传统建筑安全检查方法效率低下,现有LVLM应用存在响应不相关、输入模态受限和幻觉等问题,且LLM在训练数据可用性和实时适应性方面存在限制。 Method: 提出SiteShield框架,结合视觉和音频输入,采用多模态LVLM与检索增强生成技术,提升安全检查报告的自动生成能力。 Result: 使用真实数据测试,SiteShield相比无RAG的单模态LLM表现出更优性能,F1得分为0.82,汉明损失为0.04,精确率为0.76,召回率为0.96。 Conclusion: SiteShield为提升安全信息检索和报告生成效率提供了新途径,有效改善了建筑安全检查的自动化水平。 Abstract: Conventional construction safety inspection methods are often inefficient as they require navigating through large volume of information. Recent advances in large vision-language models (LVLMs) provide opportunities to automate safety inspections through enhanced visual and linguistic understanding. However, existing applications face limitations including irrelevant or unspecific responses, restricted modal inputs and hallucinations. Utilisation of Large Language Models (LLMs) for this purpose is constrained by availability of training data and frequently lack real-time adaptability. This study introduces SiteShield, a multi-modal LVLM-based Retrieval-Augmented Generation (RAG) framework for automating construction safety inspection reports by integrating visual and audio inputs. Using real-world data, SiteShield outperformed unimodal LLMs without RAG with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96. The findings indicate that SiteShield offers a novel pathway to enhance information retrieval and efficiency in generating safety reports.

[200] BLADE: Bias-Linked Adaptive DEbiasing

Piyush Arora,Navlika Singh,Vasubhya Diwan,Pratik Mazumder

Main category: cs.CV

TL;DR: 本文提出了一种名为BLADE的生成式去偏框架,无需先验知识或偏见冲突样本即可有效缓解神经网络中的隐式偏见问题。

Details Motivation: 神经网络容易学习训练数据中的虚假相关性(即隐式偏见),导致模型泛化能力差,而现有方法依赖不切实际的强假设。 Method: BLADE首先训练一个生成模型,在保持任务相关特征的同时跨偏见域转换图像,然后根据样本对偏见的敏感性自适应地用合成样本来优化原始图像,并通过对比对齐/错位策略学习鲁棒表征。 Result: 在多个基准数据集上,BLADE显著优于当前最先进的方法,在CIFAR-10损坏数据集的最差组设置下比最近的基线高出约18%。 Conclusion: BLADE在无需显式监督的情况下实现了有效的偏见缓解,为构建更鲁棒的深度学习模型提供了新方向。 Abstract: Neural networks have revolutionized numerous fields, yet they remain vulnerable to a critical flaw: the tendency to learn implicit biases, spurious correlations between certain attributes and target labels in training data. These biases are often more prevalent and easier to learn, causing models to rely on superficial patterns rather than task-relevant features necessary for generalization. Existing methods typically rely on strong assumptions, such as prior knowledge of these biases or access to bias-conflicting samples, i.e., samples that contradict spurious correlations and counterbalance bias-aligned samples, samples that conform to these spurious correlations. However, such assumptions are often impractical in real-world settings. We propose BLADE ({B}ias-{L}inked {A}daptive {DE}biasing), a generative debiasing framework that requires no prior knowledge of bias or bias-conflicting samples. BLADE first trains a generative model to translate images across bias domains while preserving task-relevant features. Then, it adaptively refines each image with its synthetic counterpart based on the image's susceptibility to bias. To encourage robust representations, BLADE aligns an image with its bias-translated synthetic counterpart that shares task-relevant features but differs in bias, while misaligning it with samples sharing the same bias. We evaluate BLADE on multiple benchmark datasets and show that it significantly outperforms state-of-the-art methods. Notably, it exceeds the closest baseline by an absolute margin of around 18% on the corrupted CIFAR-10 dataset under the worst group setting, establishing a new benchmark in bias mitigation and demonstrating its potential for developing more robust deep learning models without explicit supervision.

[201] From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation

Ran Eisenberg,Amit Rozner,Ethan Fetaya,Ofir Lindenbaum

Main category: cs.CV

TL;DR: 提出SEG-MIL-CBM框架,结合概念引导的图像分割与注意力机制的多实例学习,实现无需概念标注的空间接地可解释性模型。

Details Motivation: 深度神经网络缺乏透明性,易利用不可靠特征,现有可解释方法需昂贵的概念标注且缺乏空间定位能力。 Method: 将每个分割区域作为实例,通过注意力机制的多实例学习框架聚合证据,基于语义有意义区域进行推理。 Result: 在存在伪相关性、输入损坏和大规模基准的数据集上表现出鲁棒性能,并生成无需概念或组标注的空间接地、概念级解释。 Conclusion: SEG-MIL-CBM在不依赖额外标注的情况下提升了模型的可解释性与鲁棒性,为安全关键应用提供了可信的决策依据。 Abstract: Deep neural networks have achieved remarkable success in computer vision; however, their black-box nature in decision-making limits interpretability and trust, particularly in safety-critical applications. Interpretability is crucial in domains where errors have severe consequences. Existing models not only lack transparency but also risk exploiting unreliable or misleading features, which undermines both robustness and the validity of their explanations. Concept Bottleneck Models (CBMs) aim to improve transparency by reasoning through human-interpretable concepts. Still, they require costly concept annotations and lack spatial grounding, often failing to identify which regions support each concept. We propose SEG-MIL-CBM, a novel framework that integrates concept-guided image segmentation into an attention-based multiple instance learning (MIL) framework, where each segmented region is treated as an instance and the model learns to aggregate evidence across them. By reasoning over semantically meaningful regions aligned with high-level concepts, our model highlights task-relevant evidence, down-weights irrelevant cues, and produces spatially grounded, concept-level explanations without requiring annotations of concepts or groups. SEG-MIL-CBM achieves robust performance across settings involving spurious correlations (unintended dependencies between background and label), input corruptions (perturbations that degrade visual quality), and large-scale benchmarks, while providing transparent, concept-level explanations.

[202] Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

Shikang Zheng,Guantao Chen,Qinming Zhou,Yuqi Lin,Lixuan He,Chang Zou,Peiliang Cai,Jiacheng Liu,Linfeng Zhang

Main category: cs.CV

TL;DR: 提出了一种名为HyCa的混合ODE求解器启发的缓存框架,通过按维度应用不同的缓存策略,在多种模型和领域中实现了接近无损的加速效果。

Details Motivation: 现有的特征缓存方法通常对所有特征维度采用统一的策略,忽略了不同维度动态行为的异质性,导致加速效果受限。 Method: 将隐特征演化建模为跨维度的常微分方程(ODE)混合模型,并设计HyCa框架,根据各维度动态特性采用混合的缓存策略。 Result: 在FLUX、HunyuanVideo、Qwen-Image和Qwen-Image-Edit等多个模型上实现了5.55至6.24倍的推理加速,且无需重新训练。 Conclusion: HyCa通过细粒度的维度级缓存策略,有效提升了扩散Transformer的采样效率,具有广泛适用性和实际部署价值。 Abstract: Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.

[203] World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

Moo Hyun Son,Jintaek Oh,Sun Bin Mun,Jaechul Roh,Sehyun Choi

Main category: cs.CV

TL;DR: 本文提出了World-To-Image框架,通过基于代理的网络知识检索来增强文本到图像生成模型对新奇或分布外实体的生成能力。

Details Motivation: 由于预训练知识截止问题,现有T2I模型在面对新奇或分布外实体时表现不佳,因此需要引入外部知识以提升生成质量。 Method: 设计一个智能代理动态搜索网络以获取未知概念的图像,并利用这些信息进行多模态提示优化,引导生成模型更准确地合成图像。 Result: 在NICE基准上比现有方法提升了8.1%的提示准确性,且在语义对齐和视觉美感方面均优于当前最优方法,仅需不到三次迭代即可实现高效生成。 Conclusion: World-To-Image框架有效弥补了T2I模型的知识盲区,为构建能反映现实世界变化的生成系统提供了可行路径。 Abstract: While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.

[204] MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

Lixuan He,Shikang Zheng,Linfeng Zhang

Main category: cs.CV

TL;DR: 提出了一种名为Manifold-Aligned Semantic Clustering (MASC)的层次化语义聚类框架,通过利用视觉token嵌入空间的内在结构,显著提升自回归图像生成模型的训练效率和生成质量。

Details Motivation: 传统的自回归图像生成模型使用扁平、无结构的视觉token词汇表,忽视了token嵌入空间中的语义相似性和几何结构,导致预测任务复杂、训练效率低、生成质量受限。 Method: MASC框架基于codebook的内在结构构建层次化语义树,采用一种新的几何感知距离度量和密度驱动的聚合式聚类方法,对token嵌入的潜在流形进行建模,将原本高维、扁平的预测任务转化为结构化的层次化预测任务。 Result: MASC可作为即插即用模块集成到现有AR模型中,实验显示其训练速度提升高达57%,并将LlamaGen-XL的FID从2.87降至2.58,显著提升生成质量。 Conclusion: 结构化预测空间对于自回归生成模型的重要性不亚于架构创新,MASC通过引入归纳偏置,使现有AR模型达到与最先进方法相媲美的性能。 Abstract: Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.

[205] Zoom-In to Sort AI-Generated Images Out

Yikun Ji,Yan Hong,Bowen Deng,jun lan,Huijia Zhu,Weiqiang Wang,Liqing Zhang,Jianfu Zhang

Main category: cs.CV

TL;DR: 提出了一种名为ZoomIn的两阶段取证框架,用于提高AI生成图像检测的准确性和可解释性。

Details Motivation: 随着AI生成图像的快速增长,真实与合成内容之间的界限变得模糊,对数字完整性提出了重要挑战。现有的视觉-语言模型(VLM)虽然具备一定的解释能力,但在检测高质量合成图像中的细微伪影方面表现不佳。 Method: ZoomIn模仿人类视觉检查过程:第一阶段扫描整张图像以定位可疑区域;第二阶段对这些放大区域进行集中分析,得出基于视觉证据的判断。为支持训练,还提出了一个包含20,000张真实和高质量合成图像的数据集MagniFake,标注了边界框和法医解释,通过自动化VLM管道生成。 Result: 该方法在检测准确性上达到96.39%,具有较强的泛化能力,并能提供基于视觉证据、易于人类理解的解释。 Conclusion: ZoomIn有效提升了AI生成图像的检测性能与结果可解释性,结合MagniFake数据集为未来图像取证研究提供了有力支持。 Abstract: The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

[206] A Recursive Pyramidal Algorithm for Solving the Image Registration Problem

Stefan Dirnstorfer

Main category: cs.CV

TL;DR: 本文提出了一种简单、端到端可训练的图像配准算法,仅需少量训练数据和训练时间即可在某些场景下实现高精度结果,且可用十几行Python代码实现,适用于训练数据、时间和代码复杂度受限的应用场景。

Details Motivation: 图像配准需要找到使两幅图像对应点对齐的变换,传统方法可能复杂且耗时,本文旨在提出一种简洁高效、易于实现并适应资源受限场景的解决方案。 Method: 设计了一种端到端可训练的神经网络算法,结构简单,可在小规模数据上快速训练,并应用于如立体视觉等任务。 Result: 在仅使用74张图像、19x15输入窗口的情况下成功训练出模型,实现了准确的配准效果,且算法实现极为简洁。 Conclusion: 该算法以极简的代码实现了高效的图像配准,适合数据、时间或代码复杂度受限的实际应用场景,具有良好的实用性和推广潜力。 Abstract: The problem of image registration is finding a transformation that aligns two images, such that the corresponding points are in the same location. This paper introduces a simple, end-to-end trainable algorithm that is implementable in a few lines of Python code. The approach is shown to work with very little training data and training time, while achieving accurate results in some settings. An example application to stereo vision was trained from 74 images on a 19x15 input window. With just a dozen lines of Python code this algorithm excels in brevity and may serve as a good start in related scenarios with limitations to training data, training time or code complexity.

[207] Detection of retinal diseases using an accelerated reused convolutional network

Amin Ahmadi Kasani,Hedieh Sajedi

Main category: cs.CV

TL;DR: 本文提出了一种新的卷积层ArConv,用于提高深度神经网络在移动设备上的可访问性,并在眼病诊断任务中实现了高精度和低参数量。

Details Motivation: 现有的眼病检测模型计算复杂度高,限制了其在移动设备等资源受限环境中的应用,因此需要设计更高效的模型以提升可访问性。 Method: 通过重新设计和优化卷积层,提出了名为ArConv的新卷积层,并构建了一个仅含130万参数的轻量级模型。 Result: 在RfMiD数据集上,该模型准确率达到0.9328,优于MobileNetV2(220万参数)的0.9266。 Conclusion: ArConv层在降低模型复杂度的同时提升了性能,使深度神经网络更适合于移动端的眼病筛查应用。 Abstract: Convolutional neural networks are continually evolving, with some efforts aimed at improving accuracy, others at increasing speed, and some at enhancing accessibility. Improving accessibility broadens the application of neural networks across a wider range of tasks, including the detection of eye diseases. Early diagnosis of eye diseases and consulting an ophthalmologist can prevent many vision disorders. Given the importance of this issue, various datasets have been collected from the cornea to facilitate the process of making neural network models. However, most of the methods introduced in the past are computationally complex. In this study, we tried to increase the accessibility of deep neural network models. We did this at the most fundamental level, specifically by redesigning and optimizing the convolutional layers. By doing so, we created a new general model that incorporates our novel convolutional layer named ArConv layers. Thanks to the efficient performance of this new layer, the model has suitable complexity for use in mobile phones and can perform the task of diagnosing the presence of disease with high accuracy. The final model we present contains only 1.3 million parameters. In comparison to the MobileNetV2 model, which has 2.2 million parameters, our model demonstrated better accuracy when trained and evaluated on the RfMiD dataset under identical conditions, achieving an accuracy of 0.9328 versus 0.9266 on the RfMiD test set.

[208] Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu,Kam Woh Ng,Wonbong Jang,Jiadong Guo,Junlin Han,Haozhe Liu,Yiannis Douratsos,Juan C. Pérez,Zijian Zhou,Chi Phung,Tao Xiang,Juan-Manuel Pérez-Rúa

Main category: cs.CV

TL;DR: Kaleido是一种用于逼真的对象级和场景级神经渲染的生成模型家族,通过将3D视为视频的特化子域,采用序列到序列图像合成方法实现先进的视图合成性能。

Details Motivation: 现有的3D神经渲染方法通常依赖显式的3D表示和大量带相机标注的3D数据,限制了其泛化能力和可扩展性。因此,需要一种更统一、高效且无需显式3D结构的方法。 Method: Kaleido将3D建模视为视频序列中的一个特例,采用纯序列到序列的图像合成框架;使用掩码自回归机制,在仅解码器的修正流Transformer中统一处理多视角生成与视频建模,并利用大规模视频数据进行预训练。 Result: Kaleido在多个视图合成基准上达到最先进水平,在少视角设置下零样本性能显著优于其他生成方法,并在多视角设置下首次匹敌逐场景优化方法的质量。 Conclusion: Kaleido通过统一3D与视频建模框架,实现了无需显式3D表示的高质量生成视图合成,减少了对稀缺3D标注数据的依赖,展示了强大的泛化能力与应用潜力。 Abstract: We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

[209] The best performance in the CARE 2025 -- Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation

Jincan Lou,Jingkun Chen,Haoquan Li,Hang Li,Wenjian Huang,Weihua Chen,Fan Wang,Jianguo Zhang

Main category: cs.CV

TL;DR: 提出CoSSeg-TTA框架,基于nnU-Netv2结合半监督mean teacher和域适应模块,在低标注条件下提升Gd-EOB-DTPA增强MRI肝脏分割的跨中心泛化能力。

Details Motivation: 由于标注数据有限、增强协议异质性和不同设备间的域偏移,准确的肝脏分割仍具挑战性。传统图像翻译方法在单模态场景中应用受限且易引入结构失真。 Method: 构建CoSSeg-TTA框架,基于nnU-Netv2,引入半监督mean teacher机制利用未标注数据;设计包含随机直方图风格迁移和可训练对比感知网络的域适应模块以增强域多样性;采用持续测试时自适应策略提升推理鲁棒性。 Result: 实验表明,该方法在Dice分数和Hausdorff距离上均优于nnU-Netv2基线模型,且在未见域上表现出强泛化能力,尤其适用于低标注场景。 Conclusion: CoSSeg-TTA有效缓解了跨中心MRI数据的域偏移问题,提升了单模态肝脏分割的稳定性和准确性,具有良好的临床应用潜力。 Abstract: Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.

[210] Concept-Based Masking: A Patch-Agnostic Defense Against Adversarial Patch Attacks

Ayushi Mehrotra,Derek Peng,Dipkamal Bhusal,Nidhi Rastogi

Main category: cs.CV

TL;DR: 提出一种基于概念的解释方法来防御对抗性补丁攻击,无需事先知道补丁的位置或大小,在保持高鲁棒性和清洁准确率的同时优于现有方法。

Details Motivation: 现有的对抗性补丁防御方法通常依赖于对补丁大小或位置的先验知识,限制了其实际应用。因此需要一种无需此类假设的通用防御机制。 Method: 利用基于概念的解释识别并抑制最具影响力的概念激活向量,以中和补丁效果,而不显式检测补丁本身,实现补丁无关的防御。 Result: 在Imagenette数据集和ResNet-50模型上,该方法在不同补丁大小和位置下均优于PatchCleanser等最先进方法,同时保持较高的鲁棒准确率和清洁准确率。 Conclusion: 结合可解释性与鲁棒性的概念驱动防御是一种有前景的方法,为应对对抗性补丁攻击提供了可扩展的安全策略。 Abstract: Adversarial patch attacks pose a practical threat to deep learning models by forcing targeted misclassifications through localized perturbations, often realized in the physical world. Existing defenses typically assume prior knowledge of patch size or location, limiting their applicability. In this work, we propose a patch-agnostic defense that leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection. Evaluated on Imagenette with a ResNet-50, our method achieves higher robust and clean accuracy than the state-of-the-art PatchCleanser, while maintaining strong performance across varying patch sizes and locations. Our results highlight the promise of combining interpretability with robustness and suggest concept-driven defenses as a scalable strategy for securing machine learning models against adversarial patch attacks.

[211] Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu,Lau,Chao Chen,Ge Jin,Chen Feng

Main category: cs.CV

TL;DR: 提出Adapt-STformer,一种基于新型循环可变形Transformer编码器的序列视觉位置识别方法,兼顾灵活性与效率,支持变长序列、快速推理和低内存消耗。

Details Motivation: 现有基于Transformer的Seq-VPR方法在性能之外牺牲了灵活性和效率,难以满足实际中对可变序列长度、快速推理和低内存的需求。 Method: 提出Adapt-STformer,采用循环可变形Transformer编码器(Recurrent-DTE),通过迭代循环机制融合多帧时序信息,实现对不同序列长度的支持并降低计算开销。 Result: 在Nordland、Oxford和NuScenes数据集上实验表明,相比次优方法,Adapt-STformer将召回率提升高达17%,序列提取时间减少36%,内存使用降低35%。 Conclusion: Adapt-STformer在保持高性能的同时,实现了灵活性与效率的平衡,适用于实时Seq-VPR应用。 Abstract: Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively; however, existing approaches prioritize performance at the expense of flexibility and efficiency. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq-length), deliver fast inference, and have low memory usage to meet real-time constraints. To our knowledge, no existing transformer-based Seq-VPR method achieves both flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% compared to the second-best baseline.

[212] ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Jay Zhangjie Wu,Xuanchi Ren,Tianchang Shen,Tianshi Cao,Kai He,Yifan Lu,Ruiyuan Gao,Enze Xie,Shiyi Lan,Jose M. Alvarez,Jun Gao,Sanja Fidler,Zian Wang,Huan Ling

Main category: cs.CV

TL;DR: 本文提出了ChronoEdit,一种将图像编辑重构为视频生成问题的新框架,利用预训练视频生成模型和时间推理机制,在保证物理一致性的前提下实现高质量图像编辑,并在新提出的PBench-Edit基准上优于现有方法。

Details Motivation: 现有的图像编辑方法在保持编辑后对象的物理一致性方面存在不足,尤其是在需要物理合理性的世界模拟任务中,亟需能够建模物体运动与交互隐含物理规律的方法。 Method: ChronoEdit将输入图像和编辑后图像视为视频的首尾帧,利用大规模预训练视频生成模型捕捉时间一致性;引入时间推理阶段,在推理时通过联合去噪目标帧与推理token来构建合理的编辑轨迹,并在若干步后丢弃推理token以降低计算开销。 Result: 在新构建的需物理一致性的图像编辑基准PBench-Edit上,ChronoEdit在视觉保真度和物理合理性方面均优于当前最先进的基线方法。 Conclusion: ChronoEdit通过将图像编辑转化为视频生成并引入时间推理机制,有效提升了编辑结果的物理一致性,为图像编辑提供了更符合真实物理规律的解决方案。 Abstract: Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit

[213] CARE-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment

Vida Adeli,Ivan Klabucar,Javad Rajabi,Benjamin Filtjens,Soroush Mehraban,Diwei Wang,Hyewon Seo,Trung-Hieu Hoang,Minh N. Do,Candice Muller,Claudia Oliveira,Daniel Boari Coelho,Pieter Ginis,Moran Gilat,Alice Nieuwboer,Joke Spildooren,Lucas Mckay,Hyeokhyen Kwon,Gari Clifford,Christine Esper,Stewart Factor,Imari Genias,Amirhossein Dadashzadeh,Leia Shum,Alan Whone,Majid Mirmehdi,Andrea Iaboni,Babak Taati

Main category: cs.CV

TL;DR: CARE-PD是一个大规模、多中心的帕金森病3D步态数据集,支持临床评分预测和无监督动作预训练任务,显著提升模型性能。

Details Motivation: 帕金森病的客观步态评估受限于缺乏大型、多样化且有临床标注的运动数据集。 Method: 通过统一的预处理流程将来自9个队列的RGB视频或动作捕捉数据转换为匿名SMPL网格,并构建两个基准:监督式临床评分预测和无监督动作预训练任务。 Result: 在四种泛化协议下验证了临床评分预测性能;预训练显著降低了MPJPE(从60.8mm降至7.5mm),并将PD严重程度分类的macro-F1提升了17个百分点。 Conclusion: CARE-PD提供了高质量、多样化的临床步态数据,有效推动基于运动编码器的帕金森病评估方法发展。 Abstract: Objective gait assessment in Parkinson's Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce CARE-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. CARE-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson's Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation. To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on CARE-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17 percentage points, underscoring the value of clinically curated, diverse training data. CARE-PD and all benchmark code are released for non-commercial research at https://neurips2025.care-pd.ca/.

[214] GenAR: Next-Scale Autoregressive Generation for Spatial Gene Expression Prediction

Jiarui Ouyang,Yihui Wang,Yihang Gao,Yingxue Xu,Shu Yang,Hao Chen

Main category: cs.CV

TL;DR: GenAR是一种多尺度自回归框架,通过从粗到细的分层基因聚类和离散化建模,基于H&E图像预测空间转录组基因表达,克服了传统方法忽略共表达结构和连续回归偏差的问题,在多个数据集上实现了最先进的性能。

Details Motivation: 现有方法在预测空间转录组基因表达时通常独立预测每个基因并使用连续回归,忽略了基因间的共表达结构且导致生物上不合理的结果,因此需要一种更符合生物学特性的建模方式。 Method: 提出GenAR框架,将基因聚类为层次化组以捕捉跨基因依赖关系;采用无需码本的离散token生成建模基因表达,直接预测原始计数;结合组织学与空间嵌入进行条件解码;通过从粗到细的自回归策略进行多尺度预测。 Result: 在四个不同组织类型的Spatial Transcriptomics数据集上实验表明,GenAR在预测准确性方面优于现有方法,实现了最先进的性能。 Conclusion: GenAR通过离散化建模和多尺度自回归有效提升了基于H&E图像的基因表达预测精度,具有应用于精准医学和低成本分子谱分析的潜力。 Abstract: Spatial Transcriptomics (ST) offers spatially resolved gene expression but remains costly. Predicting expression directly from widely available Hematoxylin and Eosin (H&E) stained images presents a cost-effective alternative. However, most computational approaches (i) predict each gene independently, overlooking co-expression structure, and (ii) cast the task as continuous regression despite expression being discrete counts. This mismatch can yield biologically implausible outputs and complicate downstream analyses. We introduce GenAR, a multi-scale autoregressive framework that refines predictions from coarse to fine. GenAR clusters genes into hierarchical groups to expose cross-gene dependencies, models expression as codebook-free discrete token generation to directly predict raw counts, and conditions decoding on fused histological and spatial embeddings. From an information-theoretic perspective, the discrete formulation avoids log-induced biases and the coarse-to-fine factorization aligns with a principled conditional decomposition. Extensive experimental results on four Spatial Transcriptomics datasets across different tissue types demonstrate that GenAR achieves state-of-the-art performance, offering potential implications for precision medicine and cost-effective molecular profiling. Code is publicly available at https://github.com/oyjr/genar.

[215] RAP: 3D Rasterization Augmented End-to-End Planning

Lan Feng,Yang Gao,Eloi Zablocki,Quanyi Li,Wuyang Li,Sichao Liu,Matthieu Cord,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了一种名为RAP(Rasterization Augmented Planning)的轻量级数据增强方法,通过3D光栅化和特征空间对齐来提升端到端驾驶策略的闭环鲁棒性和长尾泛化能力,无需依赖计算昂贵的 photorealistic 渲染。

Details Motivation: 现有基于专家示范的端到端驾驶策略在部署后缺乏恢复数据,小错误会迅速累积成失败;而现有的基于神经渲染或游戏引擎的数字孪生方法过于耗时和昂贵,主要用于评估而非训练。作者认为训练中关键的是语义保真度和可扩展性,而非图像真实感。 Method: 提出3D光栅化技术,用轻量级的标注图元光栅化替代昂贵的渲染,生成反事实恢复动作和跨代理视角等增强数据,并引入Raster-to-Real特征空间对齐方法以缩小仿真到现实的差距。 Result: RAP在四个主流基准测试(NAVSIM v1/v2、Waymo Open Dataset视觉端到端驾驶、Bench2Drive)上实现了最先进的闭环鲁棒性和长尾泛化性能,排名首位。 Conclusion: 轻量级光栅化结合特征对齐足以有效扩展端到端驾驶训练,为昂贵的photorealistic渲染提供了一个实用且高效的替代方案。 Abstract: Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real feature-space alignment that bridges the sim-to-real gap. Together, these components form Rasterization Augmented Planning (RAP), a scalable data augmentation pipeline for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results show that lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering. Project page: https://alan-lanfeng.github.io/RAP/.

[216] Diffusion^2: Dual Diffusion Model with Uncertainty-Aware Adaptive Noise for Momentary Trajectory Prediction

Yuhao Luo,Yuang Zhang,Kehua Chen,Xinyu Zheng,Shucheng Zhang,Sikai Chen,Yinhai Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Diffusion^2的新框架,用于在缺乏足够观测数据的极端场景下进行行人轨迹预测。该方法通过两个串联的扩散模型分别进行历史轨迹反推和未来轨迹预测,并引入不确定性估计与自适应噪声调节机制,显著提升了预测精度,在ETH/UCY和Stanford Drone数据集上达到了最先进的性能。

Details Motivation: 在现实场景中,行人可能突然从盲区出现,导致观测数据不足(即瞬时轨迹),传统方法难以准确预测其运动轨迹,增加了交通事故风险。因此,研究极端情况下的行人轨迹预测对提升交通安全至关重要。 Method: 提出Diffusion^2框架,包含两个顺序连接的扩散模型:一个用于反向预测未观测到的历史轨迹,另一个用于正向预测未来轨迹;采用双头参数化机制估计生成轨迹的随机不确定性,并设计时间自适应噪声模块,动态调整前向扩散过程中的噪声尺度。 Result: Diffusion^2在ETH/UCY和Stanford Drone数据集上的瞬时轨迹预测任务中实现了最先进的性能,显著优于现有方法。 Conclusion: Diffusion^2通过结合双向扩散模型与不确定性建模,有效解决了观测数据不足情况下的行人轨迹预测难题,具有较强的鲁棒性和应用前景。 Abstract: Accurate pedestrian trajectory prediction is crucial for ensuring safety and efficiency in autonomous driving and human-robot interaction scenarios. Earlier studies primarily utilized sufficient observational data to predict future trajectories. However, in real-world scenarios, such as pedestrians suddenly emerging from blind spots, sufficient observational data is often unavailable (i.e. momentary trajectory), making accurate prediction challenging and increasing the risk of traffic accidents. Therefore, advancing research on pedestrian trajectory prediction under extreme scenarios is critical for enhancing traffic safety. In this work, we propose a novel framework termed Diffusion^2, tailored for momentary trajectory prediction. Diffusion^2 consists of two sequentially connected diffusion models: one for backward prediction, which generates unobserved historical trajectories, and the other for forward prediction, which forecasts future trajectories. Given that the generated unobserved historical trajectories may introduce additional noise, we propose a dual-head parameterization mechanism to estimate their aleatoric uncertainty and design a temporally adaptive noise module that dynamically modulates the noise scale in the forward diffusion process. Empirically, Diffusion^2 sets a new state-of-the-art in momentary trajectory prediction on ETH/UCY and Stanford Drone datasets.

[217] MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator

Xuehai He,Shijie Zhou,Thivyanth Venkateswaran,Kaizhi Zheng,Ziyu Wan,Achuta Kadambi,Xin Eric Wang

Main category: cs.CV

TL;DR: MorphoSim 是一个语言引导的框架,能够根据自然语言指令生成具有多视角一致性和对象级控制的4D动态场景,支持交互式编辑而无需完全重新生成。

Details Motivation: 现有的文本到视频模型局限于2D视图且交互性差,缺乏对动态环境的可控性和可编辑性,难以满足机器人训练和评估的需求。 Method: 提出MorphoSim框架,结合轨迹引导生成与特征场蒸馏,在4D场景中实现多视角一致性与对象级操作(如移动、重着色、删除),并通过自然语言指令驱动场景生成与编辑。 Result: 实验证明MorphoSim在保持高场景保真度的同时,实现了良好的可控性和可编辑性,支持任意视角观察和交互式修改。 Conclusion: MorphoSim为构建可控制、可编辑的时空环境提供了有效解决方案,适用于机器人训练、任务设计和可重现评估。 Abstract: World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.

[218] Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting

Xuyang Guo,Zekai Huang,Zhenmei Shi,Zhao Song,Jiahao Zhang

Main category: cs.CV

TL;DR: 本文提出一个名为VLMCountBench的基准,用于评估视觉语言模型(VLMs)在基本几何形状上的计数能力,发现当前VLMs在单一形状计数上表现良好,但在多类型形状组合(即组合计数)时存在显著缺陷。

Details Motivation: 尽管VLMs在多种视觉任务中表现出色,但其是否能准确计数物体仍不清楚。本文旨在探究这一基本能力,并揭示当前模型的潜在局限性。 Method: 构建了一个简洁有效的基准VLMCountBench,使用基本几何形状及其组合,在严格控制变量的条件下系统研究颜色、大小和提示优化等因素对计数性能的影响。 Result: 实验结果表明,当仅存在一种形状时,VLMs能够可靠计数;但在多种形状共存的情况下,计数能力显著下降,暴露出模型在组合计数上的严重缺陷。 Conclusion: 当前的VLMs在组合计数任务上存在根本性的经验局限,这为未来提升模型的细粒度视觉理解能力指明了重要研究方向。 Abstract: Vision-Language Models (VLMs) have become a central focus of today's AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.

[219] CodeFormer++: Blind Face Restoration Using Deformable Registration and Deep Metric Learning

Venkata Bharath Reddy Reddem,Akshay P Sarashetti,Ranjith Merugu,Amit Satish Unde

Main category: cs.CV

TL;DR: 本文提出了CodeFormer++,一种通过分解盲脸修复为三个子任务(身份保持、高质量生成和动态融合)来优化生成先验利用的新框架,在视觉质量和身份一致性方面均表现出优越性能。

Details Motivation: 现有盲脸修复方法在视觉质量与身份保真度之间存在权衡,容易导致身份失真或退化去除不充分。 Method: 将盲脸修复分解为三个子任务:身份保持的面部修复、高质量面部生成以及身份特征与纹理细节的动态融合;引入基于学习的可变形人脸配准模块、纹理引导的修复网络,并结合深度度量学习生成有信息量的正样本和难负样本以更好融合特征。 Result: 在真实和合成数据集上的大量实验表明,CodeFormer++在视觉保真度和身份一致性方面均优于现有方法。 Conclusion: CodeFormer++通过解耦和动态融合策略,有效平衡了生成质量与身份保持,显著提升了盲脸修复的整体性能。 Abstract: Blind face restoration (BFR) has attracted increasing attention with the rise of generative methods. Most existing approaches integrate generative priors into the restoration pro- cess, aiming to jointly address facial detail generation and identity preservation. However, these methods often suffer from a trade-off between visual quality and identity fidelity, leading to either identity distortion or suboptimal degradation removal. In this paper, we present CodeFormer++, a novel framework that maximizes the utility of generative priors for high-quality face restoration while preserving identity. We decompose BFR into three sub-tasks: (i) identity- preserving face restoration, (ii) high-quality face generation, and (iii) dynamic fusion of identity features with realistic texture details. Our method makes three key contributions: (1) a learning-based deformable face registration module that semantically aligns generated and restored faces; (2) a texture guided restoration network to dynamically extract and transfer the texture of generated face to boost the quality of identity-preserving restored face; and (3) the integration of deep metric learning for BFR with the generation of informative positive and hard negative samples to better fuse identity- preserving and generative features. Extensive experiments on real-world and synthetic datasets demonstrate that, the pro- posed CodeFormer++ achieves superior performance in terms of both visual fidelity and identity consistency.

[220] A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Yuanhao Zou,Shengji Jin,Andong Deng,Youpeng Zhao,Jun Wang,Chen Chen

Main category: cs.CV

TL;DR: 提出了一种无需训练的自适应、迭代式基于推理的帧选择方法A.I.R.,结合深度语义分析与高效计算,在视频问答任务中实现了更高准确性和效率。

Details Motivation: 现有帧选择方法在准确性与计算成本之间存在权衡:轻量模型难以捕捉复杂查询的细节,而基于VLM的方法虽准确但计算开销过大。 Method: 利用强大的视觉语言模型对复杂查询进行深度语义分析,并设计一个成本高效的迭代循环,每次仅处理最具潜力的小批量帧,实现自适应帧选择。 Result: 在多个视频问答基准上的实验表明,该方法优于现有的帧选择方法,显著提升基础VLM的性能,并在计算效率上优于其他基于VLM的技术。 Conclusion: A.I.R. 在不牺牲准确性的前提下大幅降低计算成本,为视觉语言模型在视频问答中的高效应用提供了有效解决方案。 Abstract: Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

[221] REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Qiyuan He,Yicong Li,Haotian Ye,Jinghao Wang,Xinyao Liao,Pheng-Ann Heng,Stefano Ermon,James Zou,Angela Yao

Main category: cs.CV

TL;DR: 本文提出了一种名为reAR的简单训练策略,通过引入逐token的正则化目标来缓解视觉自回归生成中的生成器-解码器不一致问题,显著提升了生成性能,在ImageNet上达到了与大规模扩散模型相媲美的效果。

Details Motivation: 视觉自回归生成模型性能落后于扩散模型,以往研究归因于tokenizer限制和光栅化顺序,本文从生成器与tokenizer不一致的角度识别出核心瓶颈。 Method: 提出reAR方法,在预测下一个token时,同时训练因果Transformer恢复当前token的视觉嵌入并预测目标token在噪声上下文中的嵌入,无需修改tokenizer、生成顺序或推理流程。 Result: 在ImageNet上,gFID从3.02降至1.86,IS提升至316.9;使用先进tokenizer时,仅177M参数即达到1.42 gFID,性能匹敌675M参数的SOTA扩散模型。 Conclusion: reAR有效缓解了生成器-解码器不一致性问题,显著提升视觉自回归模型生成质量,且兼容现有架构,具有高实用价值。 Abstract: Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

[222] SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection

Baber Jan,Saeed Anwar,Aiman H. El-Maleh,Abdul Jabbar Siddiqui,Abdul Bais

Main category: cs.CV

TL;DR: 本文提出了一种名为SPEGNet的统一网络架构,用于解决伪装物体检测中的碎片化问题,通过多尺度特征融合、通道校准和空间增强,在保持实时推理速度的同时实现了高精度边界检测。

Details Motivation: 现有方法依赖于累积复杂的模块,导致计算负担重且性能提升有限,同时低分辨率处理丢失了伪装物体的关键细节。 Method: SPEGNet采用统一设计,结合通道校准与空间增强来整合多尺度特征,利用上下文丰富的表征直接生成边界,并通过渐进式 refinement 实现中间分辨率下的尺度自适应边缘调制。 Result: 在CAMO、COD10K和NC4K数据集上分别取得了0.887、0.890和0.895的Sα分数,具备实时推理能力,且在小物体、大物体、遮挡和模糊边界等复杂场景下表现优异。 Conclusion: SPEGNet在边界精度与区域一致性之间取得了良好平衡,有效提升了伪装物体检测的性能与效率。 Abstract: Camouflaged object detection segments objects with intrinsic similarity and edge disruption. Current detection methods rely on accumulated complex components. Each approach adds components such as boundary modules, attention mechanisms, and multi-scale processors independently. This accumulation creates a computational burden without proportional gains. To manage this complexity, they process at reduced resolutions, eliminating fine details essential for camouflage. We present SPEGNet, addressing fragmentation through a unified design. The architecture integrates multi-scale features via channel calibration and spatial enhancement. Boundaries emerge directly from context-rich representations, maintaining semantic-spatial alignment. Progressive refinement implements scale-adaptive edge modulation with peak influence at intermediate resolutions. This design strikes a balance between boundary precision and regional consistency. SPEGNet achieves 0.887 $S_\alpha$ on CAMO, 0.890 on COD10K, and 0.895 on NC4K, with real-time inference speed. Our approach excels across scales, from tiny, intricate objects to large, pattern-similar ones, while handling occlusion and ambiguous boundaries. Code, model weights, and results are available on \href{https://github.com/Baber-Jan/SPEGNet}{https://github.com/Baber-Jan/SPEGNet}.

[223] MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models

Soo Yong Kim,Suin Cho,Vincent-Daniel Yun,Gyeongyeon Hwang

Main category: cs.CV

TL;DR: MedCLM是一个自动化管道,通过结合病灶框、器官分割和结构化推理,将检测数据集转化为大规模医学视觉问答(VQA)数据,并提出集成的CoT-课程学习策略,在多个医学VQA基准上达到最先进性能。

Details Motivation: 弥合临床诊断推理与人工智能在医学影像中的应用之间的差距,提升模型的可解释性和临床对齐能力。 Method: 提出MedCLP管道,生成带链式思维(CoT)推理的医学VQA数据,并设计三阶段课程学习策略(易、中、难),逐步提升模型的视觉定位与推理能力。 Result: 在多个医学VQA基准上实现了最先进的性能,验证了该方法的有效性和可扩展性。 Conclusion: MedCLM为构建具有临床一致性的可扩展医学视觉语言模型提供了有效框架。 Abstract: Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.

[224] Visual Representations inside the Language Model

Benlin Liu,Amita Kamath,Madeleine Grunde-McLaughlin,Winson Han,Ranjay Krishna

Main category: cs.CV

TL;DR: 该论文研究了多模态语言模型(MLMs)在感知任务中表现不佳的原因,发现视觉信息在语言模型中的传递和处理存在信息损失和干扰,并提出通过控制视觉信息流来提升感知能力。

Details Motivation: 尽管已有工作分析了视觉Transformer的编码器和激活机制,但尚不清楚为何多模态语言模型在感知密集型任务上表现较差。本文旨在从视觉键值令牌的处理角度揭示这一问题的根源。 Method: 作者分析了主流MLM(如LLaVA、Qwen2.5-VL等)中视觉键值令牌的信息流动,评估其在分割、语义对应、时序对应和指代表达检测等任务上的零样本表现,并比较了语言模型与原始视觉编码器(SigLIP)之间的视觉信息保留程度,同时探讨了前缀文本对视觉表示的影响。 Result: 研究发现:1)图像值令牌本身足以支持多种感知任务的零样本执行;2)语言模型虽增强了部分视觉信息,但整体仍比未经微调的视觉编码器保留更少的视觉信息;3)后期层中的输入无关图像键令牌引入了降低感知能力的伪影;4)添加文本前缀可改善视觉表征的感知能力;5)在BLINK基准中,33.3%的艺术风格问题未能将模型内部存在的感知信息输出。 Conclusion: 视觉信息在MLM的语言模型部分被削弱或抑制,限制了整体感知性能。若能更好控制视觉信息流,MLM的感知能力可显著提升。该结果为理解多模态系统中键值令牌的作用提供了新视角,并指出了改进训练策略的方向。 Abstract: Despite interpretability work analyzing VIT encoders and transformer activations, we don't yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.

[225] VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Nonghai Zhang,Zeyu Zhang,Jiazi Wang,Yang Zhao,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了首个用于古希腊陶罐分析的3D视觉问答数据集VaseVQA-3D,并构建了针对性的VaseVLM模型,通过领域自适应训练显著提升了对3D陶罐文物的理解能力。

Details Motivation: 现有视觉语言模型在文化遗产等专业领域面临数据稀缺和领域知识不足的问题,难以有效处理如3D陶罐文物分析等特定任务。 Method: 构建了包含664个古希腊陶罐3D模型及对应问答数据的VaseVQA-3D数据集,并提出VaseVLM模型,采用领域自适应训练方法提升模型在陶罐文物分析中的性能。 Result: 实验结果显示,相比之前的最先进方法,在VaseVQA-3D数据集上R@1指标提升了12.8%,词汇相似性提升了6.6%,显著增强了对3D陶罐文物的识别与理解能力。 Conclusion: 所提出的数据集和模型为数字文化遗产保护研究提供了新的技术路径,有效解决了专业领域中视觉语言模型因数据稀缺导致的性能瓶颈问题。 Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.

[226] Paper2Video: Automatic Video Generation from Scientific Papers

Zeyu Zhu,Kevin Qinghong Lin,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了PaperTalker,首个用于学术报告视频生成的多智能体框架,并构建了包含101篇论文及其配套视频、幻灯片和演讲者元数据的基准数据集。通过设计四种评估指标,验证了该方法在信息忠实度和表达效果上优于现有基线,推动了学术视频自动生成的发展。

Details Motivation: 学术报告视频制作费时费力,且需处理来自论文的密集多模态信息及多通道对齐问题,现有方法难以满足自动化生成需求。 Method: 提出PaperTalker多智能体框架,集成幻灯片生成、基于树搜索的布局优化、光标定位、字幕生成、语音合成与讲话人头像渲染,并采用逐页并行生成提升效率;同时构建包含101个真实配对样本的基准数据集,并设计四种针对性评估指标(Meta Similarity, PresentArena, PresentQuiz, IP Memory)。 Result: 在Paper2Video数据集上的实验表明,所生成的视频在信息保真度和传达效果上优于现有基线方法,显著提升自动化学术视频生成的质量与实用性。 Conclusion: PaperTalker为学术报告视频的自动化生成提供了有效解决方案,通过多智能体协同与专用评估体系,实现了从研究论文到高质量讲解视频的端到端生成,具有较强应用前景。 Abstract: Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

[227] TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement

Hao Fang,Zechao Zhan,Weixin Feng,Ziwei Huang,XuBin Li,Tiezheng Ge

Main category: cs.CV

TL;DR: 本文提出了一种面向电商领域的图像编辑模型TBStar-Edit,通过数据工程、模型架构设计和两阶段训练策略,在保持商品外观和布局一致性的前提下实现了高保真图像编辑。

Details Motivation: 通用图像编辑模型在电商场景中存在一致性不足的问题,难以满足对商品外观和布局高保真编辑的需求。 Method: 提出TBStar-Edit模型:构建高质量数据处理流程;设计包含基础模型、模式迁移模块和一致性增强模块的分层架构;采用两阶段训练策略,分别优化编辑能力和一致性保持。 Result: 在自建电商基准上进行实验,TBStar-Edit在客观指标(VIE Score)和用户主观偏好方面均优于现有通用图像编辑模型。 Conclusion: TBStar-Edit有效解决了通用模型在电商场景中的一致性问题,显著提升了电商图像编辑的质量与实用性。 Abstract: Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.

[228] Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Zijing Hu,Yunze Tong,Fengda Zhang,Junkun Yuan,Jun Xiao,Kun Kuang

Main category: cs.CV

TL;DR: 提出异步扩散模型,通过为不同像素分配不同的时间步长来改善文本到图像的对齐效果。

Details Motivation: 现有的扩散模型在生成图像时难以准确地将生成内容与输入提示对齐,主要因为所有像素同步去噪导致相关区域无法获得清晰的上下文信息。 Method: 引入异步扩散模型框架,动态调整各个像素的时间步调度,使与提示相关的区域比无关区域更渐进地去噪,从而利用更清晰的像素间上下文。 Result: 实验表明,所提出的异步扩散模型能显著提升多种提示下的文本到图像对齐性能。 Conclusion: 异步扩散模型有效解决了同步去噪带来的上下文引用不清问题,提升了生成图像与文本提示的对齐度。 Abstract: Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.

[229] TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling

Hyunmin Cho,Donghoon Ahn,Susung Hong,Jee Eun Kim,Seungryong Kim,Kyong Hwan Jin

Main category: cs.CV

TL;DR: 提出了一种名为Tangential Amplifying Guidance (TAG)的新型扩散模型引导方法,通过放大估计得分的切向分量来校正采样轨迹,提升生成质量。

Details Motivation: 现有扩散模型在图像生成中存在语义不一致或幻觉问题,且当前引导方法常依赖外部信号或结构修改,计算开销大。 Method: 利用中间样本作为投影基,放大相对于该基的得分切向分量,并基于一阶泰勒展开形式化引导过程。 Result: TAG在不修改模型结构的情况下提升了采样保真度,计算开销极小,且适用于多种架构。 Conclusion: TAG是一种即插即用、架构无关的高效引导方法,为扩散模型引导提供了新视角。 Abstract: Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.

[230] Conditional Representation Learning for Customized Tasks

Honglin Liu,Chao Sun,Peng Hu,Yunfan Li,Xi Peng

Main category: cs.CV

TL;DR: 提出条件表示学习(CRL),利用大语言模型生成描述文本构建语义基,通过视觉-语言模型将图像表示投影到用户指定条件的特征空间,以实现定制化且高效的表示学习。

Details Motivation: 现有通用表示学习方法难以对齐特定下游任务需求,而监督微调成本高昂,因此需要一种低成本、可定制的表示学习方法。 Method: 基于语义空间由其基决定的洞察,使用大语言模型根据用户指定准则生成描述性文本以构建条件语义基,并利用视觉-语言模型将图像表示投影到该条件特征空间。 Result: 在分类和检索任务上实验表明,CRL在多种定制任务中优于现有方法,具有良好的通用性和性能优势。 Conclusion: CRL提供了一种灵活、高效的方式来自定义表示学习,无需微调即可适应不同语义需求,在实际应用中展现出巨大潜力。 Abstract: Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.

[231] Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior

Sheng Wang,Ruiming Wu,Charles Herndon,Yihang Liu,Shunsuke Koga,Jeanne Shen,Zhi Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于病理学家浏览行为的可扩展监督框架,通过AI会话记录器收集全切片图像(WSI)查看日志,并构建Pathologist-o3智能体系统,在胃肠道淋巴结转移检测中表现优于现有模型。

Details Motivation: 现有的病理学基础模型缺乏对专家实际观察行为的监督数据,难以实现真正实用的、具备决策和解释能力的智能诊断系统。 Method: 开发AI Session Recorder工具记录病理医生在常规WSI查看中的导航行为,将日志转化为标准化命令和边界框,并结合人工审核生成Pathology-CoT数据集;基于此数据训练两阶段智能体Pathologist-o3进行区域推荐与行为引导推理。 Result: 在胃肠道淋巴结转移检测任务中,Pathologist-o3达到84.5%精确率、100.0%召回率和75.4%准确率,性能超越OpenAI o3模型,并在不同骨干网络上表现出良好的泛化能力。 Conclusion: 该研究首次实现了基于真实专家行为的可扩展病理智能体系统,为构建人机对齐、可升级的临床AI提供了可行路径。 Abstract: Diagnosing a whole-slide image is an interactive, multi-stage process involving changes in magnification and movement between fields. Although recent pathology foundation models are strong, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. The blocker is data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not written in textbooks or online, and therefore absent from large language model training. We introduce the AI Session Recorder, which works with standard WSI viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands (inspect or peek at discrete magnifications) and bounding boxes. A lightweight human-in-the-loop review turns AI-drafted rationales into the Pathology-CoT dataset, a form of paired "where to look" and "why it matters" supervision produced at roughly six times lower labeling time. Using this behavioral data, we build Pathologist-o3, a two-stage agent that first proposes regions of interest and then performs behavior-guided reasoning. On gastrointestinal lymph-node metastasis detection, it achieved 84.5% precision, 100.0% recall, and 75.4% accuracy, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, this constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.

[232] A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification

Hao Liu,Yunhao Gao,Wei Li,Mingyang Zhang,Maoguo Gong,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态遥感图像分类的时空频交互网络(S²Fin),通过引入频域学习来增强关键和稀疏细节特征的提取,结合空间、光谱和频率域的融合模块,在四个基准数据集上表现出优于现有方法的分类性能。

Details Motivation: 现有特征融合技术在处理异构且冗余的多模态遥感图像时,难以有效提取结构和细节特征,因此需要引入频域学习以更好建模稀疏细节信息。 Method: 提出高频频稀疏增强Transformer,采用稀疏空谱注意力优化高频滤波器参数;设计两级空频融合策略,包括自适应频率通道模块和高频共振掩码;并引入空谱注意力融合模块增强网络中间层特征提取。 Result: 在四个标注数据有限的多模态遥感数据集上实验表明,S²Fin在分类性能上优于当前最先进的方法。 Conclusion: S²Fin通过融合空间、光谱与频率域信息,有效提升了多模态遥感图像分类的精度,尤其在细节和结构特征提取方面表现突出。 Abstract: Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (S$^2$Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S$^2$Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.

[233] SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Vrushank Ahire,Aniruddh Muley,Shivam Zample,Siddharth Verma,Pranav Menon,Surbhi Madan,Abhinav Dhall

Main category: cs.CV

TL;DR: 提出一种结合Transformer架构和纹理方法的新型集成框架,用于提升深度伪造检测的准确性和鲁棒性,在DFWild-Cup数据集上达到最先进性能。

Details Motivation: 现有方法在跨数据集和生成技术时泛化能力差,难以应对日益增多的深度伪造媒体。 Method: 采用Transformer架构(如Swin Transformer和ViT)与纹理方法相结合的集成框架,引入数据分割、顺序训练、频率分割、基于patch的注意力和人脸分割技术。 Result: 在DFWild-Cup数据集上实现了最先进的检测性能,模型具有更强的泛化能力和对高影响区域(如眼睛和嘴)的敏感性。 Conclusion: 混合模型能有效应对深度伪造检测的多样化挑战,为实际应用提供了鲁棒解决方案。 Abstract: Detecting manipulated media has now become a pressing issue with the recent rise of deepfakes. Most existing approaches fail to generalize across diverse datasets and generation techniques. We thus propose a novel ensemble framework, combining the strengths of transformer-based architectures, such as Swin Transformers and ViTs, and texture-based methods, to achieve better detection accuracy and robustness. Our method introduces innovative data-splitting, sequential training, frequency splitting, patch-based attention, and face segmentation techniques to handle dataset imbalances, enhance high-impact regions (e.g., eyes and mouth), and improve generalization. Our model achieves state-of-the-art performance when tested on the DFWild-Cup dataset, a diverse subset of eight deepfake datasets. The ensemble benefits from the complementarity of these approaches, with transformers excelling in global feature extraction and texturebased methods providing interpretability. This work demonstrates that hybrid models can effectively address the evolving challenges of deepfake detection, offering a robust solution for real-world applications.

[234] Do Superpixel Segmentation Methods Influence Deforestation Image Classification?

Hugo Resende,Fabio A. Faria,Eduardo B. Neto,Isabela Borlido,Victor Sundermann,Silvio Jamil F. Guimarães,Álvaro L. Fazenda

Main category: cs.CV

TL;DR: 本研究探讨了不同超像素分割方法对热带森林砍伐检测任务中分类器训练的影响,发现结合分类器融合方法可显著提升平衡准确率。

Details Motivation: 传统上使用SLIC算法进行图像分割,但研究表明其他方法在遥感图像分割中表现更优,因此需要评估其在砍伐检测中的适用性。 Method: 比较了四种最优分割方法与SLIC在PyCaret AutoML库中前五名分类器上的性能,并采用分类器融合策略提升效果。 Result: 单一分类器下各分割方法性能差异较小,但在引入分类器融合后,平衡准确率显著提高。 Conclusion: 分割方法的选择与机器学习模型的融合对砍伐检测任务具有重要意义,分类器融合能有效提升性能。 Abstract: Image segmentation is a crucial step in various visual applications, including environmental monitoring through remote sensing. In the context of the ForestEyes project, which combines citizen science and machine learning to detect deforestation in tropical forests, image segments are used for labeling by volunteers and subsequent model training. Traditionally, the Simple Linear Iterative Clustering (SLIC) algorithm is adopted as the segmentation method. However, recent studies have indicated that other superpixel-based methods outperform SLIC in remote sensing image segmentation, and might suggest that they are more suitable for the task of detecting deforested areas. In this sense, this study investigated the impact of the four best segmentation methods, together with SLIC, on the training of classifiers for the target application. Initially, the results showed little variation in performance among segmentation methods, even when selecting the top five classifiers using the PyCaret AutoML library. However, by applying a classifier fusion approach (ensemble of classifiers), noticeable improvements in balanced accuracy were observed, highlighting the importance of both the choice of segmentation method and the combination of machine learning-based models for deforestation detection tasks.

[235] EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents

Buyuan Zhu,Shiyu Hu,Yiping Ma,Yuanming Zhang,Kang Hao Cheong

Main category: cs.CV

TL;DR: 本文提出了EduPersona,首个聚焦课堂中语言模型主观能力的大规模多语言、多学科基准,通过分解为主观连贯性、学生真实性与长期人设一致性三个任务进行评估,并验证了其在提升虚拟学生代理性能方面的有效性。

Details Motivation: 现有大语言模型在教育场景中的主观能力缺乏系统评估,限制了其在教学模拟和教师培训中的可信部署,因此需要一个基于教育理论的评估基准。 Method: 构建了一个涵盖两种语言、三个学科、十种人格类型的课堂对话数据集(1,308轮对话,12,814轮问答),并通过人格风格化扩展至约12.8万轮;提出三阶段评估框架:基础连贯性、学生真实性和长期人设一致性,并在三种主流大模型及其十种微调变体上进行实验。 Result: 实验显示,使用EduPersona微调的模型在三项任务上均有显著提升:TASK1(基础连贯性)+33.6%,TASK2(学生真实性)+30.6%,TASK3(长期人设一致性)+14.9%;揭示了不同人格建模难度的差异性。 Conclusion: EduPersona提供了首个面向课堂主观能力的评估基准,建立了可解耦、可验证的研究范式,将开源数据与框架以推动教育领域可信、类人AI的发展。 Abstract: As large language models are increasingly integrated into education, virtual student agents are becoming vital for classroom simulation and teacher training. Yet their classroom-oriented subjective abilities remain largely unassessed, limiting understanding of model boundaries and hindering trustworthy deployment. We present EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types based on the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds, corresponding to 12,814 teacher-student Q&A turns, and is further expanded through persona stylization into roughly 10 times larger scale (128k turns), providing a solid foundation for evaluation. Building on this resource, we decompose hard-to-quantify subjective performance into three progressive tasks: TASK1 basic coherence (whether behavior, emotion, expression, and voice align with classroom context), TASK2 student realism, and TASK3 long-term persona consistency, thereby establishing an evaluation framework grounded in educational theory and research value. We conduct systematic experiments on three representative LLMs, comparing their original versions with ten persona-fine-tuned variants trained on EduPersona. Results show consistent and significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%, and TASK3 +14.9%. These improvements highlight the dataset's effectiveness and research value, while also revealing the heterogeneous difficulty of persona modeling. In summary, EduPersona delivers the first classroom benchmark centered on subjective abilities, establishes a decoupled and verifiable research paradigm, and we will open-source both the dataset and the framework to support the broader research community in advancing trustworthy and human-like AI for education.

[236] MoME: Estimating Psychological Traits from Gait with Multi-Stage Mixture of Movement Experts

Andy Cǎtrunǎ,Adrian Cosma,Emilian Rǎdoi

Main category: cs.CV

TL;DR: 提出了一种基于分层多阶段运动专家混合架构(MoME),用于从2D姿态表示的步态序列中预测心理特征,在PsyMo基准上优于现有方法。

Details Motivation: 利用行走方式推断心理特质是一个具有挑战性且研究不足的问题,现有方法难以有效捕捉步态中的多层次行为信息。 Method: 设计了多阶段混合运动专家模型(MoME),将步行周期分为四个运动复杂度阶段,使用轻量级专家模型提取时空特征,并通过任务特定的门控模块自适应加权不同心理特征和阶段的专家。 Result: 在PsyMo基准(涵盖17种心理特征)上,模型在运行级别和受试者级别分别达到37.47%和44.6%的加权F1分数,优于当前最先进的步态分析模型;引入身份识别、性别预测和BMI估计等辅助任务可进一步提升性能。 Conclusion: 验证了基于多任务学习的步态分析在心理特质预测中的可行性,为基于运动的行为心理推断研究提供了新基础。 Abstract: Gait encodes rich biometric and behavioural information, yet leveraging the manner of walking to infer psychological traits remains a challenging and underexplored problem. We introduce a hierarchical Multi-Stage Mixture of Movement Experts (MoME) architecture for multi-task prediction of psychological attributes from gait sequences represented as 2D poses. MoME processes the walking cycle in four stages of movement complexity, employing lightweight expert models to extract spatio-temporal features and task-specific gating modules to adaptively weight experts across traits and stages. Evaluated on the PsyMo benchmark covering 17 psychological traits, our method outperforms state-of-the-art gait analysis models, achieving a 37.47% weighted F1 score at the run level and 44.6% at the subject level. Our experiments show that integrating auxiliary tasks such as identity recognition, gender prediction, and BMI estimation further improves psychological trait estimation. Our findings demonstrate the viability of multi-task gait-based learning for psychological trait estimation and provide a foundation for future research on movement-informed psychological inference.

[237] ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

Habin Lim,Yeongseob Won,Juwon Seo,Gyeong-Moon Park

Main category: cs.CV

TL;DR: 本文提出了一种名为ConceptSplit的新框架,用于解决文本到图像扩散模型中多概念个性化生成时的概念混合问题。

Details Motivation: 多概念个性化在文本到图像生成中受到关注,但存在概念混合的问题,即多个学习到的概念在输出图像中发生干扰或融合。 Method: 提出了两个关键组件:Token-wise Value Adaptation (ToVA),一种无需合并的训练方法,仅调整交叉注意力中的值投影;以及Latent Optimization for Disentangled Attention (LODA),通过优化输入潜在变量来缓解推理过程中的注意力纠缠。 Result: 通过大量定性和定量实验,验证了ConceptSplit能够有效实现鲁棒的多概念个性化,减少意外的概念干扰。 Conclusion: ConceptSplit通过ToVA和LODA有效解决了多概念生成中的概念混合问题,在多概念个性化任务中表现出色。 Abstract: In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit

[238] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI

Quang-Khai Bui-Tran,Minh-Toan Dinh,Thanh-Huy Nguyen,Ba-Thinh Lam,Mai-Anh Vu,Ulas Bagci

Main category: cs.CV

TL;DR: 提出一种标签高效、跨模态泛化的肝脏分割方法,适用于多相位、多厂商MRI,在标签稀缺且分布不均的临床场景下表现鲁棒。

Details Motivation: 由于标注数据稀缺且在不同成像模态和厂商系统间分布不均,现有方法难以在真实临床场景中实现准确的多相位MRI肝脏分割。 Method: 结合基础规模的3D分割主干网络微调、跨伪监督协同训练以利用未标注非对比序列,并设计标准化预处理流程,无需空间配准即可实现跨相位和跨厂商的泛化。 Result: 该方法在有标签和无标签域均表现出稳健的分割性能,尤其在GED4肝胆期标注有限的情况下仍保持高准确性。 Conclusion: 所提标签高效方法通过融合基础模型微调与协同训练,有效提升了多相位、多厂商MRI肝脏分割的实用性,展现了其在真实临床影像任务中的潜力。 Abstract: Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.

[239] ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion

Foivos Paraperas Papantoniou,Stefanos Zafeiriou

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的框架,能够在保持人脸身份一致的同时,实现对细微表情和动态过渡的精确控制,超越了以往方法的表现。

Details Motivation: 现有的生成模型在保持面部身份一致性方面已有进展,但在不牺牲身份的前提下实现细粒度表情控制仍具挑战。 Method: 基于ID一致的人脸基础模型,引入由FLAME混合形状参数引导的表情交叉注意力模块,并设计可插拔的参考适配器以实现真实图像中的表情编辑。 Result: 在多样化的图像和视频数据上训练后,模型能泛化到微表情和表情过渡,且在定量和定性评估中优于现有方法。 Conclusion: 该框架实现了高保真的身份保持与精细的表情控制,显著提升了AI驱动叙事中人物表情生成的能力。 Abstract: Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.

[240] ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model

Luo Cheng,Song Siyang,Yan Siyuan,Yu Zhen,Ge Zongyuan

Main category: cs.CV

TL;DR: 提出了一种名为ReactDiff的时序扩散框架,用于生成多样化且符合对话上下文的逼真面部反应。

Details Motivation: 现有方法难以建模真实人类反应中的随机性和动态性,导致生成的面部反应缺乏多样性和自然性。 Method: 引入一种基于时序扩散模型的方法,结合时间上的面部行为动力学和面部动作单元依赖关系两种先验知识,以保证生成反应的时间连贯性和解剖合理性。 Result: 在REACT2024数据集上的实验表明,该方法在反应质量、多样性及与对话上下文的匹配度方面均优于现有方法。 Conclusion: ReactDiff能有效生成自然、多样且符合语境的面部反应,显著提升了人机交互中虚拟角色的表现力。 Abstract: The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. Our key insight is that plausible human reactions demonstrate smoothness, and coherence over time, and conform to constraints imposed by human facial anatomy. To achieve this, ReactDiff incorporates two vital priors (spatio-temporal facial kinematics) into the diffusion process: i) temporal facial behavioral kinematics and ii) facial action unit dependencies. These two constraints guide the model toward realistic human reaction manifolds, avoiding visually unrealistic jitters, unstable transitions, unnatural expressions, and other artifacts. Extensive experiments on the REACT2024 dataset demonstrate that our approach not only achieves state-of-the-art reaction quality but also excels in diversity and reaction appropriateness.

[241] Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

KunHo Heo,GiHyun Kim,SuYeon Kim,MyeongAh Cho

Main category: cs.CV

TL;DR: 本文提出了一种新的3D语义场景图预测方法,通过设计高判别性物体特征编码器和对比预训练策略,显著提升了物体和关系预测的准确性。

Details Motivation: 现有方法在物体和关系特征表示能力上不足,过度依赖图神经网络且判别能力有限,同时未能充分利用几何与语义信息的融合。 Method: 设计了一个高判别性的物体特征编码器,并采用对比预训练策略,将物体表征学习与场景图预测解耦;结合几何和语义特征进行关系预测。 Result: 在3DSSG数据集上实验表明,该方法在所有评估指标上均显著优于现有最先进方法,且能有效提升现有框架的性能。 Conclusion: 高质量的物体特征对场景图预测至关重要,所提出的解耦预训练和多模态特征融合策略为3D语义场景图构建提供了新思路。 Abstract: 3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.

[242] Benchmark on Monocular Metric Depth Estimation in Wildlife Setting

Niccolò Niccoli,Lorenzo Seidenari,Ilaria Greco,Francesco Rovero

Main category: cs.CV

TL;DR: 本文提出了首个用于野生动物监测条件下单目度量深度估计的基准,评估了四种最先进的单目深度估计方法,并使用带有真实距离的93个相机陷阱图像进行测试。结果显示Depth Anything V2在精度和速度之间达到了最佳平衡。

Details Motivation: 由于缺乏深度信息,从单目图像中提取准确的距离测量在野生动物监测中仍具挑战性,且现有单目深度估计方法在自然野生动物环境中的表现尚未系统评估。 Method: 引入了一个新的基准数据集,包含93个带有校准ChARUCO标记获取的真实距离的相机陷阱图像,评估了Depth Anything V2、ML Depth Pro、ZoeDepth和Metric3D四种先进方法及一种几何基线方法,比较了不同深度提取策略(中位数 vs 均值)和计算效率。 Result: Depth Anything V2表现最好,平均绝对误差为0.454米,相关系数为0.962;ZoeDepth在户外自然环境中性能显著下降(MAE: 3.087米);基于中位数的深度提取在所有深度学习方法中均优于基于均值的方法;ZoeDepth最快(0.17秒/图像),而Depth Anything V2在0.22秒内实现了精度与速度的最佳平衡。 Conclusion: 该研究建立了适用于野生动物监测的单目深度估计性能基线,验证了当前方法在真实野外场景中的有效性,并为保护监测系统的实际部署提供了指导。 Abstract: Camera traps are widely used for wildlife monitoring, but extracting accurate distance measurements from monocular images remains challenging due to the lack of depth information. While monocular depth estimation (MDE) methods have advanced significantly, their performance in natural wildlife environments has not been systematically evaluated. This work introduces the first benchmark for monocular metric depth estimation in wildlife monitoring conditions. We evaluate four state-of-the-art MDE methods (Depth Anything V2, ML Depth Pro, ZoeDepth, and Metric3D) alongside a geometric baseline on 93 camera trap images with ground truth distances obtained using calibrated ChARUCO patterns. Our results demonstrate that Depth Anything V2 achieves the best overall performance with a mean absolute error of 0.454m and correlation of 0.962, while methods like ZoeDepth show significant degradation in outdoor natural environments (MAE: 3.087m). We find that median-based depth extraction consistently outperforms mean-based approaches across all deep learning methods. Additionally, we analyze computational efficiency, with ZoeDepth being fastest (0.17s per image) but least accurate, while Depth Anything V2 provides an optimal balance of accuracy and speed (0.22s per image). This benchmark establishes performance baselines for wildlife applications and provides practical guidance for implementing depth estimation in conservation monitoring systems.

[243] ExposureEngine: Oriented Logo Detection and Sponsor Visibility Analytics in Sports Broadcasts

Mehdi Houshmand Sarkhoosh,Frøy Øye,Henrik Nestor Sørlie,Nam Hoang Vu,Dag Johansen,Cise Midoglu,Tomas Kupka,Pål Halvorsen

Main category: cs.CV

TL;DR: 本文提出了一种名为ExposureEngine的端到端系统,用于在体育转播中实现精确、支持旋转检测的赞助商可见性分析。该系统采用有向边界框(OBB)来精确定位任意方向的赞助商标志,并构建了一个包含1103帧和670个唯一标志的新数据集进行训练与评估。模型在mAP@0.5上达到0.859,精度为0.96,召回率为0.87。系统还集成了语言驱动的代理层,支持通过自然语言查询生成报告和媒体内容,提供可审计且可解释的赞助商曝光度量化解决方案。

Details Motivation: 传统赞助商可见性分析依赖人工、主观且难以扩展的方法,而现有自动化系统使用轴对齐边界框(HBB),在标志旋转或透视变形时测量不准确。因此需要一种能应对复杂视角变化的精确自动化方案。 Method: 提出ExposureEngine系统,采用Oriented Bounding Box(OBB)检测旋转或倾斜的赞助商标志;构建包含瑞典顶级足球比赛视频帧的新数据集,标注OBB;训练深度学习检测模型,并集成至分析管道以计算曝光时长和屏幕覆盖率等指标;引入语言驱动的代理层,支持自然语言交互生成报告与内容。 Result: 模型在自建数据集上达到mAP@0.5为0.859,精度0.96,召回率0.87;能够准确检测不同方向和透视变换下的赞助商标志;系统成功实现精确的曝光度量化,并支持通过自然语言查询生成可视化报告和媒体内容。 Conclusion: ExposureEngine为体育转播中的赞助商可见性分析提供了高效、准确且可解释的自动化解决方案,其使用OBB显著提升了复杂场景下的检测精度,结合语言代理增强了用户交互能力,具有实际应用价值和推广潜力。 Abstract: Quantifying sponsor visibility in sports broadcasts is a critical marketing task traditionally hindered by manual, subjective, and unscalable analysis methods. While automated systems offer an alternative, their reliance on axis-aligned Horizontal Bounding Box (HBB) leads to inaccurate exposuremetrics when logos appear rotated or skewed due to dynamic camera angles and perspective distortions. This paper introduces ExposureEngine, an end-to-end system designed for accurate, rotation-aware sponsor visibility analytics in sports broadcasts, demonstrated in a soccer case study. Our approach predicts Oriented Bounding Box (OBB) to provide a geometrically precise fit to each logo regardless of the orientation on-screen. To train and evaluate our detector, we developed a new dataset comprising 1,103 frames from Swedish elite soccer, featuring 670 unique sponsor logos annotated with OBBs. Our model achieves a mean Average Precision (mAP@0.5) of 0.859, with a precision of 0.96 and recall of 0.87, demonstrating robust performance in localizing logos under diverse broadcast conditions. The system integrates these detections into an analytical pipeline that calculates precise visibility metrics, such as exposure duration and on-screen coverage. Furthermore, we incorporate a language-driven agentic layer, enabling users to generate reports, summaries, and media content through natural language queries. The complete system, including the dataset and the analytics dashboard, provides a comprehensive solution for auditable and interpretable sponsor measurement in sports media. An overview of the ExposureEngine is available online: https://youtu.be/tRw6OBISuW4 .

[244] Anomaly-Aware YOLO: A Frugal yet Robust Approach to Infrared Small Target Detection

Alina Ciocarlan,Sylvie Le Hégarat-Mascle,Sidonie Lefebvre

Main category: cs.CV

TL;DR: 提出了一种新的红外小目标检测方法AA-YOLO,通过在检测头中引入统计异常检测机制,将小目标视为背景中的异常模式,有效控制误报率。

Details Motivation: 传统目标检测器在复杂背景和小目标情况下容易产生大量误报,因此需要一种更鲁棒的方法来提升红外小目标检测性能。 Method: 将统计异常检测测试集成到YOLO的检测头中,利用小目标作为背景异常的特性,增强对小目标的敏感性并抑制背景干扰。 Result: AA-YOLO在多个IRSTD基准上表现出竞争性性能,具备良好的鲁棒性,适用于数据有限、噪声和域偏移场景,并可兼容多种YOLO主干网络和实例分割模型。 Conclusion: AA-YOLO是一种通用且高效的红外小目标检测框架,修改仅限于检测头,便于部署于资源受限的实际应用场景。 Abstract: Infrared Small Target Detection (IRSTD) is a challenging task in defense applications, where complex backgrounds and tiny target sizes often result in numerous false alarms using conventional object detectors. To overcome this limitation, we propose Anomaly-Aware YOLO (AA-YOLO), which integrates a statistical anomaly detection test into its detection head. By treating small targets as unexpected patterns against the background, AA-YOLO effectively controls the false alarm rate. Our approach not only achieves competitive performance on several IRSTD benchmarks, but also demonstrates remarkable robustness in scenarios with limited training data, noise, and domain shifts. Furthermore, since only the detection head is modified, our design is highly generic and has been successfully applied across various YOLO backbones, including lightweight models. It also provides promising results when integrated into an instance segmentation YOLO. This versatility makes AA-YOLO an attractive solution for real-world deployments where resources are constrained. The code will be publicly released.

[245] Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

Masoumeh Chapariniya,Teodora Vukovic,Sarah Ebling,Volker Dellwo

Main category: cs.CV

TL;DR: 本文研究了基于Transformer架构在自然面对面对话场景中进行人物识别的性能,提出了一种双流框架来分别建模空间配置和时间运动模式,并引入多尺度时间Transformer进行分层运动建模。实验结果表明,领域特定训练显著优于迁移学习,空间特征比时间动态更具判别力,特征级融合可将准确率提升至98.03%。

Details Motivation: 在自然面对面对话场景中,传统方法对人物识别的效果有限,需要探索更有效的模型架构来充分利用姿态和运动信息。 Method: 采用双流Transformer框架,分别处理从CANDOR语料库中提取的133个COCO WholeBody关键点的空间配置和时间运动模式;比较预训练与从头训练,研究速度特征的使用,并提出多尺度时间Transformer进行分层建模。 Result: 空间Transformer达到95.74%准确率,多尺度时间Transformer达到93.90%,特征级融合使性能提升至98.03%;领域特定训练显著优于迁移学习,空间信息比时间动态更具判别性。 Conclusion: Transformer架构在自然交互中的人物识别具有潜力,空间与时间信息互补,特征融合可进一步提升性能,为未来多模态和跨文化研究提供了启示。 Abstract: This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.

[246] Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Chi Yan,Dan Xu

Main category: cs.CV

TL;DR: 本文提出了PG-Occ,一种基于渐进式高斯变换器的开放词汇3D占据预测框架,通过渐进式在线稠密化和各向异性感知采样策略,在提升细节表达能力的同时兼顾计算效率,显著优于现有方法。

Details Motivation: 传统方法受限于固定语义类别,且现有文本对齐的3D场景建模在稀疏与稠密表示之间存在权衡:稀疏表示难以捕捉小物体,稠密表示计算开销大。因此需要一种既能支持开放词汇查询又能高效建模细节的方法。 Method: 提出PG-Occ框架,采用渐进式在线稠密化策略,逐步增强3D高斯表示以捕获细粒度场景细节;引入各向异性感知的采样策略并结合时空融合,自适应地为不同尺度和阶段的高斯分配感受野,实现更有效的特征聚合。 Result: 在多个实验中验证了PG-Occ的优越性,相比此前最佳方法实现了14.3%的mIoU相对提升,达到当前最先进的性能。 Conclusion: PG-Occ有效平衡了开放词汇3D占据预测中的表达能力与计算效率,通过渐进式增强和自适应采样策略显著提升了场景理解精度,尤其在小物体检测和细节还原方面表现突出。 Abstract: The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

[247] Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

Xiaomeng Fan,Yuchuan Mao,Zhi Gao,Yuwei Wu,Jin Chen,Yunde Jia

Main category: cs.CV

TL;DR: 提出一种新的开放词汇学习方法,通过生成未知类数据来估计开放环境中的数据分布,从而有效降低估计误差,在11个数据集上显著优于基线方法。

Details Motivation: 现有方法仅使用已知类数据估计开放环境中的数据分布,由于缺乏未知类,导致估计误差无法识别,限制了开放词汇学习的性能。 Method: 提出一个类-域感知的数据生成流程和分布对齐算法,利用分层语义树和从已知类推断的域信息生成未知类数据,并通过最大化后验概率进行分布估计与对齐。 Result: 在11个数据集上实验表明,该方法相比基线最高提升14%,有效降低了分布估计误差,提升了开放词汇学习的泛化能力。 Conclusion: 通过生成未知类数据可有效估计开放环境中的数据分布,理论和实验证明了该方法在降低估计误差和提升模型泛化方面的优越性。 Abstract: Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on $11$ datasets demonstrate that our method outperforms baseline approaches by up to $14\%$, highlighting its effectiveness and superiority.

[248] Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Max Kirchner,Hanna Hoffmann,Alexander C. Jenke,Oliver L. Saldanha,Kevin Pfeiffer,Weam Kanjo,Julia Alekseenko,Claas de Boer,Santhi Raj Kolamuri,Lorenzo Mazza,Nicolas Padoy,Sophia Bano,Annika Reinke,Lena Maier-Hein,Danail Stoyanov,Jakob N. Kather,Fiona R. Kolbinger,Sebastian Bodenstedt,Stefanie Speidel

Main category: cs.CV

TL;DR: The FedSurg challenge benchmarks federated learning (FL) for surgical video classification, focusing on generalization to unseen clinical centers and local adaptation via fine-tuning without sharing patient data.

Details Motivation: To evaluate the performance and robustness of current FL methods in multi-center surgical video analysis, particularly in classifying inflammation stages in appendicitis while preserving data privacy. Method: Participants used methods including foundation models with linear probing, metric learning with triplet loss, and FL aggregation techniques (FedAvg, FedMedian, FedSAM) on the Appendix300 dataset. Two tasks were evaluated: generalization to unseen centers and center-specific adaptation after fine-tuning, with performance measured by F1-score and Expected Cost, and rankings validated via bootstrapping and statistical testing. Result: Generalization to unseen centers was limited. All teams improved after fine-tuning, but ranking stability was low. The ViViT-based model achieved the best overall performance. Key challenges included class imbalance, hyperparameter tuning in decentralized settings, and trade-offs between personalization and robustness. Spatiotemporal modeling and context-aware preprocessing showed promise. Conclusion: The FedSurg Challenge provides the first benchmark for FL in surgical video classification, highlighting the need for better architectural design, preprocessing, and loss functions. It emphasizes the trade-off between local adaptation and global generalization, guiding future development of robust, imbalance-aware FL methods in clinical AI. Abstract: Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

[249] Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization

Javed Ahmad,Federico Dassiè,Selene Frascella,Gabriele Marchello,Ferdinando Cannella,Arianna Traviglia

Main category: cs.CV

TL;DR: 提出一种自动化的双机器人3D扫描系统,用于高保真文化遗产数字化,显著提升几何精度和效率。

Details Motivation: 传统3D扫描方法依赖人工操作和专业知识,难以保证扫描质量和覆盖完整性。 Method: 采用双机器人协同策略,一个搭载扫描仪,另一个控制承载文物的托盘;通过区域参数化、优化轨迹规划和路径点分布实现全面覆盖。 Result: 实验结果显示,该方法在Chamfer Distance和F-score上优于基线方法,具有更高的重建精度和更少的遮挡。 Conclusion: 所提出的自动化双机器人系统能有效提升文化遗產3D扫描的精度与效率,减少对专家操作员的依赖。 Abstract: High-fidelity 3D scanning is essential for preserving cultural heritage artefacts, supporting documentation, analysis, and long-term conservation. However, conventional methods typically require specialized expertise and manual intervention to maintain optimal scanning conditions and coverage. We present an automated two-robot scanning system that eliminates the need for handheld or semi-automatic workflows by combining coordinated robotic manipulation with high-resolution 3D scanning. Our system parameterizes the scanning space into distinct regions, enabling coordinated motion planning between a scanner-equipped robot and a tray-handling robot. Optimized trajectory planning and waypoint distribution ensure comprehensive surface coverage, minimize occlusions, and balance reconstruction accuracy with system efficiency. Experimental results show that our approach achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy, improved digitization efficiency, and reduced reliance on expert operators.

[250] A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Alon Kaya,Igal Bilik,Inna Stainvas

Main category: cs.CV

TL;DR: 本文系统比较了大规模CNN和ViT在图像几何估计任务中的表现,发现在小样本场景下CNN因归纳偏置和较小容量表现更优,而ViT在大数据和跨域场景下更具优势。

Details Motivation: 探讨ViT和大规模CNN作为几何估计任务骨干网络在低数据环境下的有效性。 Method: 对比ResNet、EfficientNet、CLIP-ResNet等CNN与CLIP-ViT、DINO等ViT模型在不同数据量下的性能,特别是在少样本设置中。 Result: 在大数据场景中ViT优于CNN,但在小数据场景中CNN可达到与ViT相当的性能,且ViT在跨域评估中泛化能力更强。 Conclusion: 应根据数据规模谨慎选择模型架构,未来可研究平衡局部与全局表示的混合架构。 Abstract: Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.

[251] DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing

Qi Li,Shuwen Qiu,Julien Han,Xingzi Xu,Mehmet Saygin Seyfioglu,Kee Kiat Koo,Karim Bouyarmane

Main category: cs.CV

TL;DR: 本文提出了一种基于Diffusion Transformer(DiT)的新型虚拟试穿框架DiT-VTON,通过多种配置探索和大规模多样化数据训练,实现了在细节保留、鲁棒性和跨类别泛化方面的显著提升,并扩展为支持多种产品类型和图像编辑功能的通用虚拟试用(VTA)系统。

Details Motivation: 现有虚拟试穿(VTO)模型在细粒度细节保持、对真实场景图像的鲁棒性、采样效率、图像编辑能力和跨商品类别的泛化方面存在不足,亟需更强大且通用的解决方案。 Method: 采用Diffusion Transformer(DiT)架构,探索了上下文token拼接、通道拼接和ControlNet集成等多种图像条件输入方式;在包含多样背景、非结构化参考和非服装类别的扩展数据集上进行训练,以增强模型鲁棒性和泛化能力。 Result: 在VITON-HD数据集上超越了当前最先进的方法,在无需额外条件编码器的情况下实现了更好的细节保留和鲁棒性;在涵盖数千个商品类别的多样化数据集上,也优于具备虚拟试用(VTA)和图像编辑能力的其他模型。 Conclusion: DiT-VTON不仅提升了虚拟试穿的性能,还将其扩展为多功能的虚拟试用(VTA)平台,展示了扩散Transformer在图像条件生成任务中的巨大潜力,推动了电商中个性化视觉体验的发展。 Abstract: The rapid growth of e-commerce has intensified the demand for Virtual Try-On (VTO) technologies, enabling customers to realistically visualize products overlaid on their own images. Despite recent advances, existing VTO models face challenges with fine-grained detail preservation, robustness to real-world imagery, efficient sampling, image editing capabilities, and generalization across diverse product categories. In this paper, we present DiT-VTON, a novel VTO framework that leverages a Diffusion Transformer (DiT), renowned for its performance on text-conditioned image generation, adapted here for the image-conditioned VTO task. We systematically explore multiple DiT configurations, including in-context token concatenation, channel concatenation, and ControlNet integration, to determine the best setup for VTO image conditioning. To enhance robustness, we train the model on an expanded dataset encompassing varied backgrounds, unstructured references, and non-garment categories, demonstrating the benefits of data scaling for VTO adaptability. DiT-VTON also redefines the VTO task beyond garment try-on, offering a versatile Virtual Try-All (VTA) solution capable of handling a wide range of product categories and supporting advanced image editing functionalities such as pose preservation, localized editing, texture transfer, and object-level customization. Experimental results show that our model surpasses state-of-the-art methods on VITON-HD, achieving superior detail preservation and robustness without reliance on additional condition encoders. It also outperforms models with VTA and image editing capabilities on a diverse dataset spanning thousands of product categories.

[252] Did you just see that? Arbitrary view synthesis for egocentric replay of operating room workflows from ambient sensors

Han Zhang,Lalithkumar Seenivasan,Jose L. Porras,Roger D. Soberanis-Mukul,Hao Ding,Hongchao Shu,Benjamin D. Killeen,Ankita Ghosh,Lonny Yarmus,Masaru Ishii,Angela Christine Argento,Mathias Unberath

Main category: cs.CV

TL;DR: EgoSurg 是首个从手术室固定摄像头视频中重建动态、自我中心视角回放的框架,无需干预临床流程,为手术数据科学提供了新基础。

Details Motivation: 传统手术观察依赖固定视角或记忆,无法记录指导临床决策的自我中心视觉视角,限制了对手术安全、培训和流程优化的理解。 Method: EgoSurg 结合几何驱动的神经渲染与基于扩散的视角增强技术,直接从壁挂式固定摄像头视频中合成任意时刻的高保真自我中心视角。 Result: 在多中心手术案例和对照研究中,EgoSurg 能以高质量和高保真度重建个体视觉场和任意视角。 Conclusion: EgoSurg 将现有手术室摄像基础设施转化为可导航的动态3D记录,实现了从任意角度可视化、体验和分析手术实践的新范式。 Abstract: Observing surgical practice has historically relied on fixed vantage points or recollections, leaving the egocentric visual perspectives that guide clinical decisions undocumented. Fixed-camera video can capture surgical workflows at the room-scale, but cannot reconstruct what each team member actually saw. Thus, these videos only provide limited insights into how decisions that affect surgical safety, training, and workflow optimization are made. Here we introduce EgoSurg, the first framework to reconstruct the dynamic, egocentric replays for any operating room (OR) staff directly from wall-mounted fixed-camera video, and thus, without intervention to clinical workflow. EgoSurg couples geometry-driven neural rendering with diffusion-based view enhancement, enabling high-visual fidelity synthesis of arbitrary and egocentric viewpoints at any moment. In evaluation across multi-site surgical cases and controlled studies, EgoSurg reconstructs person-specific visual fields and arbitrary viewpoints with high visual quality and fidelity. By transforming existing OR camera infrastructure into a navigable dynamic 3D record, EgoSurg establishes a new foundation for immersive surgical data science, enabling surgical practice to be visualized, experienced, and analyzed from every angle.

[253] AvatarVTON: 4D Virtual Try-On for Animatable Avatars

Zicheng Jiang,Jixin Gao,Shengfeng He,Xinzhe Li,Yulong Zheng,Zhaotong Yang,Junyu Dong,Yong Du

Main category: cs.CV

TL;DR: 提出AvatarVTON,首个4D虚拟试穿框架,能从单张商品图像生成逼真试穿效果,支持自由姿态控制、新视角渲染和多样化服装选择。

Details Motivation: 现有方法依赖多视角图像或物理先验,难以实现动态服装交互与高真实感,限制了在AR/VR等场景的应用。 Method: 提出两个关键模块:1)无需先验的双向光流校正策略(Reciprocal Flow Rectifier),提升时间一致性;2)非线性变形器(Non-Linear Deformer),分解高斯图以实现自适应非线性服装变形。 Result: 在统一基准上实验表明,AvatarVTON在保真度、多样性及动态服装真实感方面优于现有方法,支持自由姿态与新视角渲染。 Conclusion: AvatarVTON实现了高质量、动态且可控的4D虚拟试穿,适用于AR/VR、游戏和数字人应用。 Abstract: We propose AvatarVTON, the first 4D virtual try-on framework that generates realistic try-on results from a single in-shop garment image, enabling free pose control, novel-view rendering, and diverse garment choices. Unlike existing methods, AvatarVTON supports dynamic garment interactions under single-view supervision, without relying on multi-view garment captures or physics priors. The framework consists of two key modules: (1) a Reciprocal Flow Rectifier, a prior-free optical-flow correction strategy that stabilizes avatar fitting and ensures temporal coherence; and (2) a Non-Linear Deformer, which decomposes Gaussian maps into view-pose-invariant and view-pose-specific components, enabling adaptive, non-linear garment deformations. To establish a benchmark for 4D virtual try-on, we extend existing baselines with unified modules for fair qualitative and quantitative comparisons. Extensive experiments show that AvatarVTON achieves high fidelity, diversity, and dynamic garment realism, making it well-suited for AR/VR, gaming, and digital-human applications.

[254] Flow Matching for Conditional MRI-CT and CBCT-CT Image Synthesis

Arnela Hadzic,Simon Johannes Joham,Martin Urschler

Main category: cs.CV

TL;DR: 本文提出了一种基于3D流匹配(Flow Matching)框架的MRI或CBCT生成合成CT(sCT)的方法,用于支持无MRI和基于CBCT的自适应放射治疗。该方法通过学习速度场将高斯噪声转换为sCT,并在SynthRAD2025挑战赛的数据集上验证了其有效性,能够准确重建整体解剖结构,但在细节保留方面受限于训练分辨率。

Details Motivation: 为了实现MRI-only和CBCT-based的自适应放疗,需要从MRI或CBCT生成高质量的合成CT(sCT),以减少患者辐射暴露并提高治疗精度。现有方法在图像质量和细节恢复上仍有不足。 Method: 采用全3D流匹配(Flow Matching)框架,利用轻量级3D编码器从输入MRI或CBCT中提取特征,条件化地学习将高斯噪声体积积分到sCT图像的流速度场。分别训练MRI→sCT和CBCT→sCT模型,在腹部、头颈和胸部三个区域进行评估。 Result: 在SynthRAD2025挑战赛的基准上验证显示,该方法能准确重建全局解剖结构,但因内存和计算限制导致训练分辨率较低,细部结构保持有限。 Conclusion: 所提出的3D流匹配方法在生成sCT方面具有潜力,能有效重建主要解剖结构,未来将探索基于patch的训练和潜在空间流模型以提升分辨率和局部结构保真度。 Abstract: Generating synthetic CT (sCT) from MRI or CBCT plays a crucial role in enabling MRI-only and CBCT-based adaptive radiotherapy, improving treatment precision while reducing patient radiation exposure. To address this task, we adopt a fully 3D Flow Matching (FM) framework, motivated by recent work demonstrating FM's efficiency in producing high-quality images. In our approach, a Gaussian noise volume is transformed into an sCT image by integrating a learned FM velocity field, conditioned on features extracted from the input MRI or CBCT using a lightweight 3D encoder. We evaluated the method on the SynthRAD2025 Challenge benchmark, training separate models for MRI $\rightarrow$ sCT and CBCT $\rightarrow$ sCT across three anatomical regions: abdomen, head and neck, and thorax. Validation and testing were performed through the challenge submission system. The results indicate that the method accurately reconstructs global anatomical structures; however, preservation of fine details was limited, primarily due to the relatively low training resolution imposed by memory and runtime constraints. Future work will explore patch-based training and latent-space flow models to improve resolution and local structural fidelity.

[255] Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Muquan Li,Hang Gou,Dongyang Zhang,Shuang Liang,Xiurui Xie,Deqiang Ouyang,Ke Qin

Main category: cs.CV

TL;DR: 提出了一种名为AT-BPTT的自动截断BPTT框架,通过动态调整截断位置和窗口大小,显著提升数据集蒸馏的效率和性能。

Details Motivation: 现有数据集蒸馏方法依赖随机截断策略,缺乏灵活性且效果不佳;而神经网络在不同训练阶段表现出不同的学习动态,需更智能的截断机制。 Method: 提出AT-BPTT框架,包含三个关键组件:基于概率的阶段感知时间步选择、基于梯度变化的自适应窗口大小策略、以及低秩Hessian近似以降低计算开销。 Result: 在CIFAR-10、CIFAR-100、Tiny-ImageNet和ImageNet-1K上实验表明,相比基线方法平均提升6.16%准确率,内循环优化速度加快3.9倍,并节省63%内存开销。 Conclusion: AT-BPTT通过动态适应神经网络的学习动态,实现了高效且高性能的数据集蒸馏,为深度学习中的优化提供了新思路。 Abstract: The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

[256] Detailed Aerial Mapping of Photovoltaic Power Plants Through Semantically Significant Keypoints

Viktor Kozák,Jan Chudoba,Libor Přeučil

Main category: cs.CV

TL;DR: 提出了一种基于航拍图像的光伏电站自动建模方法,通过视觉分割和结构推断实现对光伏组件的精确映射。

Details Motivation: 缺乏准确且最新的光伏电站模型,影响其运维效率。 Method: 利用航拍图像进行光伏组件的视觉分割,识别布局关键点,并融合多幅图像的检测结果以保持结构完整性。 Result: 在两个不同电站上验证了该方法,生成了包含3D位置和语义结构的紧凑地理参考模型。 Conclusion: 该方法可自动化构建高精度光伏电站模型,减少对外部数据依赖,适用于电站维护。 Abstract: An accurate and up-to-date model of a photovoltaic (PV) power plant is essential for its optimal operation and maintenance. However, such a model may not be easily available. This work introduces a novel approach for PV power plant mapping based on aerial overview images. It enables the automation of the mapping process while removing the reliance on third-party data. The presented mapping method takes advantage of the structural layout of the power plants to achieve detailed modeling down to the level of individual PV modules. The approach relies on visual segmentation of PV modules in overview images and the inference of structural information in each image, assigning modules to individual benches, rows, and columns. We identify visual keypoints related to the layout and use these to merge detections from multiple images while maintaining their structural integrity. The presented method was experimentally verified and evaluated on two different power plants. The final fusion of 3D positions and semantic structures results in a compact georeferenced model suitable for power plant maintenance.

[257] From Actions to Kinesics: Extracting Human Psychological States through Bodily Movements

Cheyu Lin,Katherine A. Flanigan

Main category: cs.CV

TL;DR: 提出一种基于3D骨架数据的kinesics识别框架,结合ST-GCN和CNN,通过迁移学习推断人类行为的心理状态,保护隐私且可扩展。

Details Motivation: 传统方法难以泛化且无法动态捕捉人类心理状态,缺乏兼顾隐私与通用性的建模手段。 Method: 结合空间-时间图卷积网络(ST-GCN)与卷积神经网络(CNN),利用迁移学习从3D骨架关节数据中直接推断kinesics功能。 Result: 在DUET数据集上实现了对人类行为的准确、可扩展建模,能有效揭示反映认知与情绪状态的身体运动潜在结构。 Conclusion: 该框架为增强强化学习驱动的人-环境交互模拟提供了新的、以人为本的建模路径。 Abstract: Understanding the dynamic relationship between humans and the built environment is a key challenge in disciplines ranging from environmental psychology to reinforcement learning (RL). A central obstacle in modeling these interactions is the inability to capture human psychological states in a way that is both generalizable and privacy preserving. Traditional methods rely on theoretical models or questionnaires, which are limited in scope, static, and labor intensive. We present a kinesics recognition framework that infers the communicative functions of human activity -- known as kinesics -- directly from 3D skeleton joint data. Combining a spatial-temporal graph convolutional network (ST-GCN) with a convolutional neural network (CNN), the framework leverages transfer learning to bypass the need for manually defined mappings between physical actions and psychological categories. The approach preserves user anonymity while uncovering latent structures in bodily movements that reflect cognitive and emotional states. Our results on the Dyadic User EngagemenT (DUET) dataset demonstrate that this method enables scalable, accurate, and human-centered modeling of behavior, offering a new pathway for enhancing RL-driven simulations of human-environment interaction.

[258] Read the Room: Inferring Social Context Through Dyadic Interaction Recognition in Cyber-physical-social Infrastructure Systems

Cheyu Lin,John Martins,Katherine A. Flanigan,Ph. D

Main category: cs.CV

TL;DR: 本文探讨了如何通过基于骨架的动作识别技术,在保护隐私的前提下,利用深度传感器识别现实世界中的二元人际互动,从而推动网络-物理-社会基础设施系统(CPSIS)实现社会性目标。

Details Motivation: 传统网络-物理系统(CPS)多关注经济与安全目标,忽视了人类中心的社会效益。本文旨在通过理解人与基础设施及人与人之间的互动,弥补这一空白,推动更具社会价值的智能系统发展。 Method: 研究采用深度传感器获取人体骨骼动作数据,比较了五种基于骨架的互动识别算法,并在一个包含12类二元互动的真实数据集上进行评估,这些互动涵盖象征性动作和情感表达等交流类型。 Result: 研究表明,基于骨架的动作识别方法在识别复杂的人际互动方面具有潜力,且相比RGB摄像头更有利于隐私保护,为后续建模社会行为与量化社会收益提供了可行路径。 Conclusion: 将骨架动作分析应用于二元互动识别是实现CPSIS中社会目标的关键步骤,未来系统应结合此类技术以促进积极的社会成果,同时兼顾隐私与文化差异。 Abstract: Cyber-physical systems (CPS) integrate sensing, computing, and control to improve infrastructure performance, focusing on economic goals like performance and safety. However, they often neglect potential human-centered (or ''social'') benefits. Cyber-physical-social infrastructure systems (CPSIS) aim to address this by aligning CPS with social objectives. This involves defining social benefits, understanding human interactions with each other and infrastructure, developing privacy-preserving measurement methods, modeling these interactions for prediction, linking them to social benefits, and actuating the physical environment to foster positive social outcomes. This paper delves into recognizing dyadic human interactions using real-world data, which is the backbone to measuring social behavior. This lays a foundation to address the need to enhance understanding of the deeper meanings and mutual responses inherent in human interactions. While RGB cameras are informative for interaction recognition, privacy concerns arise. Depth sensors offer a privacy-conscious alternative by analyzing skeletal movements. This study compares five skeleton-based interaction recognition algorithms on a dataset of 12 dyadic interactions. Unlike single-person datasets, these interactions, categorized into communication types like emblems and affect displays, offer insights into the cultural and emotional aspects of human interactions.

[259] ERDE: Entropy-Regularized Distillation for Early-exit

Martial Guidez,Stefan Duffner,Yannick Alpou,Oscar Röth,Christophe Garcia

Main category: cs.CV

TL;DR: 提出了一种结合早期退出和知识蒸馏的神经网络压缩方法,通过引入基于熵的损失函数优化学生模型训练,在保持分类性能的同时显著降低计算复杂度。

Details Motivation: 深度神经网络在图像分类中表现出色但计算成本高,难以应用于实时和边缘设备,因此需要有效的压缩技术来降低计算开销。 Method: 将早期退出与知识蒸馏相结合,训练一个简化的学生早期退出模型,从复杂的教师早期退出模型中学习,并引入一种新的基于熵的损失函数,用于处理教师分类错误的样本。 Result: 在CIFAR10、CIFAR100和SVHN数据集上的实验表明,该方法在不牺牲分类性能的前提下显著降低了计算复杂度,优于传统知识蒸馏方法。 Conclusion: 所提出的方法有效平衡了准确性与效率,为知识蒸馏在动态架构和其他应用场景中的研究提供了新方向。 Abstract: Although deep neural networks and in particular Convolutional Neural Networks have demonstrated state-of-the-art performance in image classification with relatively high efficiency, they still exhibit high computational costs, often rendering them impractical for real-time and edge applications. Therefore, a multitude of compression techniques have been developed to reduce these costs while maintaining accuracy. In addition, dynamic architectures have been introduced to modulate the level of compression at execution time, which is a desirable property in many resource-limited application scenarios. The proposed method effectively integrates two well-established optimization techniques: early exits and knowledge distillation, where a reduced student early-exit model is trained from a more complex teacher early-exit model. The primary contribution of this research lies in the approach for training the student early-exit model. In comparison to the conventional Knowledge Distillation loss, our approach incorporates a new entropy-based loss for images where the teacher's classification was incorrect. The proposed method optimizes the trade-off between accuracy and efficiency, thereby achieving significant reductions in computational complexity without compromising classification performance. The validity of this approach is substantiated by experimental results on image classification datasets CIFAR10, CIFAR100 and SVHN, which further opens new research perspectives for Knowledge Distillation in other contexts.

[260] μDeepIQA: deep learning-based fast and robust image quality assessment with local predictions for optical microscopy

Elena Corbetta,Thomas Bocklitz

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的光学显微图像质量评估方法μDeepIQA,通过迁移学习重新训练卷积神经网络,实现对显微图像的快速、稳定且具有泛化能力的质量预测,并支持局部区域质量可视化。

Details Motivation: 传统图像质量评估方法在处理大规模数据时计算成本高,且对非理想条件下的图像稳定性差,难以满足光学显微成像研究中可靠分析的需求。 Method: 采用为自然图像设计的深度卷积神经网络架构,通过对该网络进行重训练,使其能够预测光学显微图像的单项质量指标和整体质量评分,并实现图像块级别的质量评估。 Result: μDeepIQA能够在标准方法失效的非理想范围内仍保持稳定的质量估计性能,具备快速预测、抗异常值干扰以及空间质量分布可视化的能力。 Conclusion: 深度学习模型在光学显微图像质量评估中展现出优越的泛化性与实用性,可显著提升显微成像数据分析的可靠性与效率。 Abstract: Optical microscopy is one of the most widely used techniques in research studies for life sciences and biomedicine. These applications require reliable experimental pipelines to extract valuable knowledge from the measured samples and must be supported by image quality assessment (IQA) to ensure correct processing and analysis of the image data. IQA methods are implemented with variable complexity. However, while most quality metrics have a straightforward implementation, they might be time consuming and computationally expensive when evaluating a large dataset. In addition, quality metrics are often designed for well-defined image features and may be unstable for images out of the ideal domain. To overcome these limitations, recent works have proposed deep learning-based IQA methods, which can provide superior performance, increased generalizability and fast prediction. Our method, named $\mathrm{\mu}$DeepIQA, is inspired by previous studies and applies a deep convolutional neural network designed for IQA on natural images to optical microscopy measurements. We retrained the same architecture to predict individual quality metrics and global quality scores for optical microscopy data. The resulting models provide fast and stable predictions of image quality by generalizing quality estimation even outside the ideal range of standard methods. In addition, $\mathrm{\mu}$DeepIQA provides patch-wise prediction of image quality and can be used to visualize spatially varying quality in a single image. Our study demonstrates that optical microscopy-based studies can benefit from the generalizability of deep learning models due to their stable performance in the presence of outliers, the ability to assess small image patches, and rapid predictions.

[261] In-Field Mapping of Grape Yield and Quality with Illumination-Invariant Deep Learning

Ciem Cornelissen,Sander De Coninck,Axel Willekens,Sam Leroux,Pieter Simoens

Main category: cs.CV

TL;DR: 本文提出了一种端到端、基于物联网的机器人系统,用于葡萄园中葡萄产量和品质(糖度、酸度)的非破坏性、实时、空间分辨映射。

Details Motivation: 解决田间高光谱成像中由于光照变化引起的“域偏移”问题,实现稳定可靠的葡萄品质评估。 Method: 系统结合了高性能的葡萄串检测与重量估计模型,以及基于高光谱数据的新型深度学习框架;采用Light-Invariant Spectral Autoencoder (LISA)这一领域对抗架构,从未经校准的数据中学习光照不变特征。 Result: 在包含三种不同光照条件的数据集上验证,系统实现了82%的葡萄串检测召回率和0.76的重量预测R²值;LISA模块使品质预测泛化性能相比基线提升超过20%。 Conclusion: 该系统能够生成高分辨率、地理配准的葡萄产量与品质数据,为精准葡萄栽培提供可操作的数据支持。 Abstract: This paper presents an end-to-end, IoT-enabled robotic system for the non-destructive, real-time, and spatially-resolved mapping of grape yield and quality (Brix, Acidity) in vineyards. The system features a comprehensive analytical pipeline that integrates two key modules: a high-performance model for grape bunch detection and weight estimation, and a novel deep learning framework for quality assessment from hyperspectral (HSI) data. A critical barrier to in-field HSI is the ``domain shift" caused by variable illumination. To overcome this, our quality assessment is powered by the Light-Invariant Spectral Autoencoder (LISA), a domain-adversarial framework that learns illumination-invariant features from uncalibrated data. We validated the system's robustness on a purpose-built HSI dataset spanning three distinct illumination domains: controlled artificial lighting (lab), and variable natural sunlight captured in the morning and afternoon. Results show the complete pipeline achieves a recall (0.82) for bunch detection and a $R^2$ (0.76) for weight prediction, while the LISA module improves quality prediction generalization by over 20% compared to the baselines. By combining these robust modules, the system successfully generates high-resolution, georeferenced data of both grape yield and quality, providing actionable, data-driven insights for precision viticulture.

[262] BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping

Hayat Rajani,Valerio Franchi,Borja Martinez-Clavel Valles,Raimon Ramos,Rafael Garcia,Nuno Gracias

Main category: cs.CV

TL;DR: 本文介绍了一个大规模多模态海底栖息地映射数据集,包含约一百万张侧扫声呐图像、测深图和自主水下航行器(AUV)获取的共注册光学图像,并提供了约36,000个带分割标注的声呐图像,旨在推动水下栖息地分类与多传感器融合的机器学习研究。

Details Motivation: 由于缺乏大规模标注数据集,限制了机器学习在海底栖息地映射中的发展和基准测试,因此需要构建一个标准化、多模态且公开的数据资源。 Method: 收集了西班牙加泰罗尼亚海岸的大规模侧扫声呐(SSS)图像,并结合测深图和AUV采集的光学图像;对约36,000张SSS图像进行了人工分割标注,实现监督学习;通过空间配准实现多模态数据融合,并提供开源预处理与标注工具。 Result: 发布了一个包含原始传感器数据、拼接图像和标注的多模态数据集,支持监督学习与自监督跨模态表征学习,为水下栖息地分类提供了可公开访问的基准数据集。 Conclusion: 该数据集为海底栖息地映射建立了标准化基准,有助于推动自主海床分类和多传感器集成技术的发展。 Abstract: Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.

[263] Comparative Analysis of YOLOv5, Faster R-CNN, SSD, and RetinaNet for Motorbike Detection in Kigali Autonomous Driving Context

Ngeyen Yinkfu,Sunday Nwovu,Jonathan Kayizzi,Angelique Uwamahoro

Main category: cs.CV

TL;DR: 本研究在卢旺达基加利使用198张自建图像数据集比较了YOLOv5、Faster R-CNN、SSD和RetinaNet四种目标检测模型在摩托车检测上的性能,旨在评估其在资源受限环境下自动驾驶系统的适用性。

Details Motivation: 由于基加利的摩托车出租车行为不可预测且常无视交通规则,给自动驾驶系统带来挑战,因此需要有效检测摩托车以提升安全性与导航能力。 Method: 采用PyTorch框架并结合迁移学习方法,在自建数据集上训练和评估四种目标检测模型(YOLOv5、Faster R-CNN、SSD、RetinaNet),评估指标包括准确率、定位能力和推理速度。 Result: 论文未明确给出各模型的具体性能排序,但分析了实现中的挑战,如数据集规模有限和模型复杂度高,影响了在资源受限环境下的实时应用。 Conclusion: 建议未来工作采用简化模型架构,以提高在卢旺达等发展中国家自动驾驶系统的可及性和实用性。 Abstract: In Kigali, Rwanda, motorcycle taxis are a primary mode of transportation, often navigating unpredictably and disregarding traffic rules, posing significant challenges for autonomous driving systems. This study compares four object detection models--YOLOv5, Faster R-CNN, SSD, and RetinaNet--for motorbike detection using a custom dataset of 198 images collected in Kigali. Implemented in PyTorch with transfer learning, the models were evaluated for accuracy, localization, and inference speed to assess their suitability for real-time navigation in resource-constrained settings. We identify implementation challenges, including dataset limitations and model complexities, and recommend simplified architectures for future work to enhance accessibility for autonomous systems in developing countries like Rwanda.

[264] A Semantics-Aware Hierarchical Self-Supervised Approach to Classification of Remote Sensing Images

Giulio Weikmann,Gianmarco Perantoni,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出了一种语义感知的层次共识方法(SAHC),用于遥感图像分类,通过整合层次特定的分类头和可训练的层次矩阵,在深度网络中学习层次特征与关系,并引入层次共识机制确保跨层次的概率一致性。

Details Motivation: 现有方法多忽略预定义的标签层次结构,仅关注细粒度分类,未能充分利用类间的语义关系。 Method: 设计了SAHC方法,包含多个针对不同粒度级别的分类头,引入可学习的层次矩阵以自监督方式建模层次结构,并采用层次共识机制作为加权集成,保证跨层级预测的一致性。 Result: 在三个具有不同层次复杂度的基准数据集上验证了方法的有效性,结合多种主干网络展示了其适应性,实验结果表明该方法能有效引导网络学习并提升分类性能。 Conclusion: SAHC能够有效利用标签层次结构,在遥感图像分类任务中表现出良好的鲁棒性和泛化能力,为层次化分类提供了新的解决方案。 Abstract: Deep learning has become increasingly important in remote sensing image classification due to its ability to extract semantic information from complex data. Classification tasks often include predefined label hierarchies that represent the semantic relationships among classes. However, these hierarchies are frequently overlooked, and most approaches focus only on fine-grained classification schemes. In this paper, we present a novel Semantics-Aware Hierarchical Consensus (SAHC) method for learning hierarchical features and relationships by integrating hierarchy-specific classification heads within a deep network architecture, each specialized in different degrees of class granularity. The proposed approach employs trainable hierarchy matrices, which guide the network through the learning of the hierarchical structure in a self-supervised manner. Furthermore, we introduce a hierarchical consensus mechanism to ensure consistent probability distributions across different hierarchical levels. This mechanism acts as a weighted ensemble being able to effectively leverage the inherent structure of the hierarchical classification task. The proposed SAHC method is evaluated on three benchmark datasets with different degrees of hierarchical complexity on different tasks, using distinct backbone architectures to effectively emphasize its adaptability. Experimental results show both the effectiveness of the proposed approach in guiding network learning and the robustness of the hierarchical consensus for remote sensing image classification tasks.

[265] REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis

Alec K. Peltekian,Halil Ertugrul Aktas,Gorkem Durak,Kevin Grudzinski,Bradford C. Bemiss,Carrie Richardson,Jane E. Dematte,G. R. Scott Budinger,Anthony J. Esposito,Alexander Misharin,Alok Choudhary,Ankit Agrawal,Ulas Bagci

Main category: cs.CV

TL;DR: 提出了一种新的解剖学信息引导的混合专家模型(REN),用于间质性肺病分类,结合放射组学和深度学习特征,显著提升了性能与可解释性。

Details Motivation: 传统MoE模型缺乏医学影像中所需的解剖结构先验知识,难以有效建模区域病理变化,因此需要一种结合解剖约束的专用框架。 Method: 设计了七个针对不同肺叶和双侧肺组合的专家网络,并采用多模态门控机制融合放射组学生物标志物与深度学习特征(CNN、ViT、Mamba)动态加权专家输出。 Result: 在ILD分类任务中,REN平均AUC达0.8646,比SwinUNETR基线提升12.5%(p=0.031);下叶专家AUC达0.88–0.90,优于传统DL模型,且符合已知疾病进展模式。 Conclusion: REN是一种可扩展、具临床可解释性的解剖引导模型,在患者级交叉验证中表现出强泛化能力,可推广至其他结构化医学影像任务。 Abstract: Mixture-of-Experts (MoE) architectures have significantly contributed to scalable machine learning by enabling specialized subnetworks to tackle complex tasks efficiently. However, traditional MoE systems lack domain-specific constraints essential for medical imaging, where anatomical structure and regional disease heterogeneity strongly influence pathological patterns. Here, we introduce Regional Expert Networks (REN), the first anatomically-informed MoE framework tailored specifically for medical image classification. REN leverages anatomical priors to train seven specialized experts, each dedicated to distinct lung lobes and bilateral lung combinations, enabling precise modeling of region-specific pathological variations. Multi-modal gating mechanisms dynamically integrate radiomics biomarkers and deep learning (DL) features (CNN, ViT, Mamba) to weight expert contributions optimally. Applied to interstitial lung disease (ILD) classification, REN achieves consistently superior performance: the radiomics-guided ensemble reached an average AUC of 0.8646 +/- 0.0467, a +12.5 percent improvement over the SwinUNETR baseline (AUC 0.7685, p = 0.031). Region-specific experts further revealed that lower-lobe models achieved AUCs of 0.88-0.90, surpassing DL counterparts (CNN: 0.76-0.79) and aligning with known disease progression patterns. Through rigorous patient-level cross-validation, REN demonstrates strong generalizability and clinical interpretability, presenting a scalable, anatomically-guided approach readily extensible to other structured medical imaging applications.

[266] Unsupervised Active Learning via Natural Feature Progressive Framework

Yuxi Liu,Catherine Lalman,Yimin Yang

Main category: cs.CV

TL;DR: 提出了一种新的无监督主动学习方法NFPF,通过特定特征学习机(SFLM)和重构差异度量来更有效地选择重要样本,在性能上优于现有UAL方法并媲美监督主动学习方法。

Details Motivation: 现有无监督主动学习(UAL)方法依赖局部梯度评分和浅层线性选择,难以准确评估样本重要性,且对噪声数据敏感,无法充分覆盖数据分布。 Method: 提出自然特征渐进框架(NFPF),核心是使用特定特征学习机(SFLM)量化样本对模型性能的贡献,并利用SFLM定义重构差异度量进行初始样本选择,实现更优的数据代表性与鲁棒性。 Result: 实验表明NFPF在多个视觉数据集上显著优于现有UAL方法,性能媲美监督主动学习方法,消融研究和可视化结果验证了其优越性、鲁棒性和更好的数据分布覆盖能力。 Conclusion: NFPF为无监督主动学习提供了新范式,有效解决了传统方法在样本重要性评估和选择机制上的局限,具有更强的实用潜力。 Abstract: The effectiveness of modern deep learning models is predicated on the availability of large-scale, human-annotated datasets, a process that is notoriously expensive and time-consuming. While Active Learning (AL) offers a strategic solution by labeling only the most informative and representative data, its iterative nature still necessitates significant human involvement. Unsupervised Active Learning (UAL) presents an alternative by shifting the annotation burden to a single, post-selection step. Unfortunately, prevailing UAL methods struggle to achieve state-of-the-art performance. These approaches typically rely on local, gradient-based scoring for sample importance estimation, which not only makes them vulnerable to ambiguous and noisy data but also hinders their capacity to select samples that adequately represent the full data distribution. Moreover, their use of shallow, one-shot linear selection falls short of a true UAL paradigm. In this paper, we propose the Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes how sample importance is measured. At its core, NFPF employs a Specific Feature Learning Machine (SFLM) to effectively quantify each sample's contribution to model performance. We further utilize the SFLM to define a powerful Reconstruction Difference metric for initial sample selection. Our comprehensive experiments show that NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets. Detailed ablation studies and qualitative visualizations provide compelling evidence for NFPF's superior performance, enhanced robustness, and improved data distribution coverage.

[267] Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion

Xin Li,Kaixiang Yang,Qiang Li,Zhiwei Wang

Main category: cs.CV

TL;DR: 提出了一种基于条件扩散模型的双视图乳腺X线图像转换框架CA3D-Diff,结合列感知交叉注意力和隐式3D结构重建,有效解决跨视图结构错位问题,在视觉保真度和结构一致性上优于现有方法,并提升单视图恶性肿瘤分类性能。

Details Motivation: 在实际临床中,乳腺X线双视图(CC和MLO)常因采集错误或伪影导致某一视图缺失或质量下降,影响诊断效果,因此需要一种鲁棒的方法来恢复缺失视图并保持解剖结构一致性。 Method: 提出CA3D-Diff框架:1)设计列感知交叉注意力机制,利用解剖结构在不同视图中多位于相似列位置的特性,通过高斯衰减偏置增强局部列相关性;2)引入隐式3D结构重建模块,将噪声2D潜在表示反投影为粗略3D特征体,结合乳腺投影几何结构,在去噪UNet中注入优化后的3D信息以指导生成。 Result: 实验表明CA3D-Diff在双向视图转换任务中优于现有最先进方法,具有更高的视觉保真度和结构一致性;合成视图可显著提升筛查场景下单视图恶性肿瘤分类性能。 Conclusion: CA3D-Diff通过引入列感知注意力和隐式3D结构建模,有效解决了乳腺X线图像跨视图转换中的非刚性形变和组织重叠挑战,具备良好的临床应用潜力。 Abstract: Dual-view mammography, including craniocaudal (CC) and mediolateral oblique (MLO) projections, offers complementary anatomical views crucial for breast cancer diagnosis. However, in real-world clinical workflows, one view may be missing, corrupted, or degraded due to acquisition errors or compression artifacts, limiting the effectiveness of downstream analysis. View-to-view translation can help recover missing views and improve lesion alignment. Unlike natural images, this task in mammography is highly challenging due to large non-rigid deformations and severe tissue overlap in X-ray projections, which obscure pixel-level correspondences. In this paper, we propose Column-Aware and Implicit 3D Diffusion (CA3D-Diff), a novel bidirectional mammogram view translation framework based on conditional diffusion model. To address cross-view structural misalignment, we first design a column-aware cross-attention mechanism that leverages the geometric property that anatomically corresponding regions tend to lie in similar column positions across views. A Gaussian-decayed bias is applied to emphasize local column-wise correlations while suppressing distant mismatches. Furthermore, we introduce an implicit 3D structure reconstruction module that back-projects noisy 2D latents into a coarse 3D feature volume based on breast-view projection geometry. The reconstructed 3D structure is refined and injected into the denoising UNet to guide cross-view generation with enhanced anatomical awareness. Extensive experiments demonstrate that CA3D-Diff achieves superior performance in bidirectional tasks, outperforming state-of-the-art methods in visual fidelity and structural consistency. Furthermore, the synthesized views effectively improve single-view malignancy classification in screening settings, demonstrating the practical value of our method in real-world diagnostics.

[268] SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

Théophane Vallaeys,Jakob Verbeek,Matthieu Cord

Main category: cs.CV

TL;DR: 本文提出了一种新的单步扩散解码器SSDD,无需对抗损失即可实现比KL-VAE更高的重建质量和更快的采样速度,可作为KL-VAE的即插即用替代方案。

Details Motivation: 现有基于KL-VAE的tokenizer依赖对抗损失且解码较慢,扩散解码器虽更优但训练不稳定且需迭代采样,因此需要一种高效、稳定且快速的替代方案。 Method: 设计了一种新的像素扩散解码器架构,结合Transformer组件并采用无GAN的训练方式,通过知识蒸馏将多步扩散模型性能迁移到单步解码器上。 Result: SSDD在无对抗损失的情况下,将重建FID从0.87提升至0.50,吞吐量提高1.4倍,在生成质量不变的前提下采样速度提升3.8倍。 Conclusion: SSDD是首个无需对抗损失、专为单步重建优化的扩散解码器,兼具高质量重建与高速采样,可广泛用于构建更优的生成模型。 Abstract: Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

[269] ActiveMark: on watermarking of visual foundation models via massive activations

Anna Chistyakova,Mikhail Pautov

Main category: cs.CV

TL;DR: 本文提出一种针对视觉基础模型(VFM)的版权验证方法,通过在少量表达性层和编码器-解码器网络中嵌入数字水印来实现所有权验证,且水印在下游任务微调后仍可检测。

Details Motivation: 由于视觉基础模型训练成本高,其开发者需要保护知识产权,防止用户非法 redistribution,因此亟需可靠的版权验证技术。 Method: 通过微调VFM的一小组表达性层,并结合一个小型编码器-解码器网络,将数字水印嵌入到一组保留图像的内部表示中,以实现所有权验证。 Result: 理论和实验表明,该方法对非水印模型的误检率低,对水印模型的漏检率也低,且水印在功能副本(如下游任务微调后)中仍可检测。 Conclusion: 所提出的方法能有效验证视觉基础模型的所有权,具有低误检和低漏检的优势,适用于保护VFMs的知识产权。 Abstract: Being trained on large and vast datasets, visual foundation models (VFMs) can be fine-tuned for diverse downstream tasks, achieving remarkable performance and efficiency in various computer vision applications. The high computation cost of data collection and training motivates the owners of some VFMs to distribute them alongside the license to protect their intellectual property rights. However, a dishonest user of the protected model's copy may illegally redistribute it, for example, to make a profit. As a consequence, the development of reliable ownership verification tools is of great importance today, since such methods can be used to differentiate between a redistributed copy of the protected model and an independent model. In this paper, we propose an approach to ownership verification of visual foundation models by fine-tuning a small set of expressive layers of a VFM along with a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. Importantly, the watermarks embedded remain detectable in the functional copies of the protected model, obtained, for example, by fine-tuning the VFM for a particular downstream task. Theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection of a non-watermarked model and a low probability of false misdetection of a watermarked model.

[270] Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition

Koen Vellenga,H. Joe Steinhauer,Jonas Andersson,Anders Sjögren

Main category: cs.CV

TL;DR: 提出了一种基于潜在表示的不确定性估计方法(LUR和RLUR),在保持分类性能的同时,高效地进行OOD检测。

Details Motivation: 现有的最后一层概率深度学习方法在检测分布外样本时性能不稳定,且训练复杂、调参困难。 Method: 在预训练DNN后添加变换层,生成多个潜在表示以估计不确定性,并引入排斥训练改进LUR。 Result: LUR和RLUR在分类性能上与其他方法相当,在OOD检测上表现优异,且训练更高效、更易调参;同时贡献了NuScenes数据集的新标注。 Conclusion: LUR是一种高效、易用的不确定性估计方法,适用于资源受限的安全关键场景。 Abstract: Deep neural networks (DNNs) are increasingly applied to safety-critical tasks in resource-constrained environments, such as video-based driver action and intention recognition. While last layer probabilistic deep learning (LL-PDL) methods can detect out-of-distribution (OOD) instances, their performance varies. As an alternative to last layer approaches, we propose extending pre-trained DNNs with transformation layers to produce multiple latent representations to estimate the uncertainty. We evaluate our latent uncertainty representation (LUR) and repulsively trained LUR (RLUR) approaches against eight PDL methods across four video-based driver action and intention recognition datasets, comparing classification performance, calibration, and uncertainty-based OOD detection. We also contribute 28,000 frame-level action labels and 1,194 video-level intention labels for the NuScenes dataset. Our results show that LUR and RLUR achieve comparable in-distribution classification performance to other LL-PDL approaches. For uncertainty-based OOD detection, LUR matches top-performing PDL methods while being more efficient to train and easier to tune than approaches that require Markov-Chain Monte Carlo sampling or repulsive training procedures.

[271] Exploring the Efficacy of Modified Transfer Learning in Identifying Parkinson's Disease Through Drawn Image Patterns

Nabil Daiyan,Md Rakibul Haque

Main category: cs.CV

TL;DR: 提出了一种基于机器学习的方法,利用手绘螺旋和波浪图像作为生物标志物来检测帕金森病。

Details Motivation: 早期诊断帕金森病至关重要,但传统方法繁琐且昂贵,因此需要一种非侵入性、低成本的诊断方法。 Method: 采用卷积神经网络、迁移学习和注意力机制,并通过数据增强增加训练样本多样性;模型包含预训练CNN、自定义卷积层和集成硬投票三个阶段。 Result: 螺旋图像的精确率、召回率和F1分数加权平均为90%;波浪图像为96.67%;通过集成硬投票后整体准确率达到93.3%。 Conclusion: 该研究表明机器学习在帕金森病早期诊断中具有潜力,提供了一种非侵入性且经济有效的解决方案。 Abstract: Parkinson's disease (PD) is a progressive neurodegenerative condition characterized by the death of dopaminergic neurons, leading to various movement disorder symptoms. Early diagnosis of PD is crucial to prevent adverse effects, yet traditional diagnostic methods are often cumbersome and costly. In this study, a machine learning-based approach is proposed using hand-drawn spiral and wave images as potential biomarkers for PD detection. Our methodology leverages convolutional neural networks (CNNs), transfer learning, and attention mechanisms to improve model performance and resilience against overfitting. To enhance the diversity and richness of both spiral and wave categories, the training dataset undergoes augmentation to increase the number of images. The proposed architecture comprises three phases: utilizing pre-trained CNNs, incorporating custom convolutional layers, and ensemble voting. Employing hard voting further enhances performance by aggregating predictions from multiple models. Experimental results show promising accuracy rates. For spiral images, weighted average precision, recall, and F1-score are 90%, and for wave images, they are 96.67%. After combining the predictions through ensemble hard voting, the overall accuracy is 93.3%. These findings underscore the potential of machine learning in early PD diagnosis, offering a non-invasive and cost-effective solution to improve patient outcomes.

[272] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yunlong Tang,Jing Bi,Pinxin Liu,Zhenyu Pan,Zhangyun Tan,Qianxiang Shen,Jiani Liu,Hang Hua,Junjia Guo,Yunzhong Xiao,Chao Huang,Zhiyuan Wang,Susan Liang,Xinyi Liu,Yizhi Song,Yuhe Nie,Jia-Xing Zhong,Bozheng Li,Daiqing Qi,Ziyun Zeng,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Daiki Shimada,Han Liu,Jiebo Luo,Chenliang Xu

Main category: cs.CV

TL;DR: 本文综述了视频大视觉语言模型(Video-LMMs)的后训练方法,提出了一个涵盖监督微调、强化学习和测试时扩展的系统性框架,旨在提升模型在复杂时空推理与多模态理解方面的能力。

Details Motivation: 尽管Video-LMM在视频理解上取得进展,但其后训练阶段的方法分散且缺乏统一框架,亟需系统性梳理以推动进一步发展。 Method: 提出一个包含监督微调(SFT)、基于可验证目标的强化学习(RL)和测试时扩展(TTS)的三支柱分类体系,并对各技术的角色、关联及针对视频特性的适应性进行结构化分析。 Result: 建立了首个关于Video-LMM后训练方法的全面综述,明确了关键设计原则、评估协议,并整理了常用数据集与基准;指出了奖励设计、可扩展性和成本性能优化等开放问题。 Conclusion: 该研究为Video-LMM的后训练提供了统一框架,有助于研究人员系统理解现有方法并推动未来在高效、强推理能力视频理解模型上的发展。 Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

[273] SegMASt3R: Geometry Grounded Segment Matching

Rohit Jayanti,Swayam Agrawal,Vansh Garg,Siddharth Tourani,Muhammad Haris Khan,Sourav Garg,Madhava Krishna

Main category: cs.CV

TL;DR: 本文提出了一种利用3D基础模型的空间理解能力进行宽基线段匹配的新方法,能够在极端视角变化下(最大180度)有效匹配图像间的语义区域,显著优于现有方法,并在3D实例分割和图像目标导航等下游任务中展现出优势。

Details Motivation: 为了克服传统关键点匹配在遮挡、光照变化和大视角变换下的局限性,需要一种更鲁棒的跨图像区域对应方法。 Method: 利用3D基础模型的归纳偏置设计了一个新架构,用于匹配具有极大视角差异的图像对中的段落。 Result: 在ScanNet++和Replica数据集上,该方法在AUPRC指标上比现有最先进方法(如SAM2视频传播器和局部特征匹配方法)高出最多30%,并在3D实例分割和图像目标导航任务中表现优异。 Conclusion: 所提出的方法在宽基线段匹配中表现出色,具备强鲁棒性,且能有效提升多种下游视觉任务的性能。 Abstract: Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by upto 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance segmentation and image-goal navigation. Project Page: https://segmast3r.github.io/

[274] No-reference Quality Assessment of Contrast-distorted Images using Contrast-enhanced Pseudo Reference

Mohammad-Ali Mahmoudpour,Saeed Mahmoudpour

Main category: cs.CV

TL;DR: 本文提出了一种针对对比度失真图像的无参考图像质量评估(NR-IQA)指标,通过生成内容适配的伪参考图像,将NR问题转化为全参考(FR)评估,显著提升了评估精度。

Details Motivation: 对比度变化严重影响图像质量,但现有图像质量评估方法大多忽视了对比度失真,因其视觉影响和特性不同于模糊、噪声等传统失真类型。因此,亟需一种专门针对对比度失真的有效评估方法。 Method: 提出一种基于伪参考图像的无参考质量评估方法:首先利用多种对比度增强算法生成伪参考图像;构建大规模增强图像数据集,训练分类网络以根据图像内容和失真类型选择最优增强算法;最后采用全参考方式比较伪参考图像与失真图像的质量差异。 Result: 在包含对比度失真的三个数据库(CCID2014、TID2013、CSIQ)上的实验表明,所提方法相较于现有主流IQA指标具有更优的评估性能,相关系数更高,表现出良好的预测准确性和稳定性。 Conclusion: 该方法通过内容自适应的伪参考图像生成策略,成功将无参考评估转化为更精确的全参考评估,为对比度失真图像的质量评估提供了一个有效且性能优越的新方案。 Abstract: Contrast change is an important factor that affects the quality of images. During image capturing, unfavorable lighting conditions can cause contrast change and visual quality loss. While various methods have been proposed to assess the quality of images under different distortions such as blur and noise, contrast distortion has been largely overlooked as its visual impact and properties are different from other conventional types of distortions. In this paper, we propose a no-reference image quality assessment (NR-IQA) metric for contrast-distorted images. Using a set of contrast enhancement algorithms, we aim to generate pseudo-reference images that are visually close to the actual reference image, such that the NR problem is transformed to a Full-reference (FR) assessment with higher accuracy. To this end, a large dataset of contrast-enhanced images is produced to train a classification network that can select the most suitable contrast enhancement algorithm based on image content and distortion for pseudo-reference image generation. Finally, the evaluation is performed in the FR manner to assess the quality difference between the contrast-enhanced (pseudoreference) and degraded images. Performance evaluation of the proposed method on three databases containing contrast distortions (CCID2014, TID2013, and CSIQ), indicates the promising performance of the proposed method.

[275] Neuroplastic Modular Framework: Cross-Domain Image Classification of Garbage and Industrial Surfaces

Debojyoti Ghosh,Soumya K Ghosh,Adrijit Goswami

Main category: cs.CV

TL;DR: 提出了一种名为Neuroplastic Modular Classifier的新型混合架构,结合ResNet-50、Vision Transformer和FAISS相似性检索,具有动态扩展的模块化设计,显著提升了垃圾分类和工业表面缺陷检测的准确性和适应性。

Details Motivation: 为了在动态环境中实现高效且准确的废物和工业表面缺陷分类,提升模型对复杂数据的适应能力和泛化性能。 Method: 采用ResNet-50进行局部特征提取,结合Vision Transformer捕获全局语义信息,并引入FAISS-based相似性检索增强特征空间;模型采用受生物学习系统启发的神经可塑性模块化设计,可在训练中动态增长。 Result: 在垃圾分类和KolektorSDD2工业缺陷检测数据集上均优于传统静态模型,展现出更高的准确性和适应性。 Conclusion: Neuroplastic Modular Classifier为现实世界的图像分类提供了可扩展、高性能的解决方案,在环境与工业领域均有广泛应用前景。 Abstract: Efficient and accurate classification of waste and industrial surface defects is essential for ensuring sustainable waste management and maintaining high standards in quality control. This paper introduces the Neuroplastic Modular Classifier, a novel hybrid architecture designed for robust and adaptive image classification in dynamic environments. The model combines a ResNet-50 backbone for localized feature extraction with a Vision Transformer (ViT) to capture global semantic context. Additionally, FAISS-based similarity retrieval is incorporated to provide a memory-like reference to previously encountered data, enriching the model's feature space. A key innovation of our architecture is the neuroplastic modular design composed of expandable, learnable blocks that dynamically grow during training when performance plateaus. Inspired by biological learning systems, this mechanism allows the model to adapt to data complexity over time, improving generalization. Beyond garbage classification, we validate the model on the Kolektor Surface Defect Dataset 2 (KolektorSDD2), which involves industrial defect detection on metal surfaces. Experimental results across domains show that the proposed architecture outperforms traditional static models in both accuracy and adaptability. The Neuroplastic Modular Classifier offers a scalable, high-performance solution for real-world image classification, with strong applicability in both environmental and industrial domains.

[276] Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo,Songhao Han,Yuandong Pu,Boxiang Qiu,Sayak Paul,Yue Liao,Yihao Liu,Jie Shao,Xi Chen,Si Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出了首个针对结构化视觉内容(如图表、数学图形)生成与编辑的系统性研究,包括大规模数据集构建、统一模型训练及新评估基准StructBench和StructScore。

Details Motivation: 现有视觉生成模型在生成自然图像方面表现优异,但在生成或编辑需要逻辑结构、文本渲染和多模态推理的结构化图像时存在困难,缺乏事实准确性。 Method: 构建了包含130万高质量结构化图像对的数据集,并结合VLM与FLUX.1 Kontext通过轻量级连接器进行融合;采用三阶段训练策略并结合推理时外部推理器增强。 Result: 提出了StructBench(含1700多个挑战性实例)和StructScore评估指标;实验显示当前主流模型表现不佳,而所提模型在编辑任务上表现优异,推理时增强有效提升性能。 Conclusion: 该工作推动了面向结构化视觉内容的统一多模态基础模型发展,且公开了数据集、模型与基准以促进后续研究。 Abstract: While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

[277] Character Mixing for Video Generation

Tingting Liao,Chongjian Ge,Guangyi Liu,Hao Li,Yi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到视频生成框架,通过跨角色嵌入(CCE)和跨角色增强(CCA)技术,实现不同风格角色间的自然交互,同时保持其身份、行为和风格的一致性。

Details Motivation: 研究在文本到视频生成中角色跨语境交互的挑战,特别是角色身份与行为的保持以及混合风格导致的风格错乱问题。 Method: 提出Cross-Character Embedding(CCE)来学习多模态数据中的角色身份和行为逻辑,并设计Cross-Character Augmentation(CCA)以合成共存和混合风格数据增强训练。 Result: 在包含10个卡通与真人角色的基准上实验表明,该方法显著提升了身份保持、交互质量和对风格错乱的鲁棒性。 Conclusion: 所提框架有效支持了异构角色间的自然交互,推动了生成式叙事的新形式。 Abstract: Imagine Mr. Bean stepping into Tom and Jerry--can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character's identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes style delusion, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling.Additional results and videos are available on our project page: https://tingtingliao.github.io/mimix/.

[278] VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang,Ning Yu,Gordon Chen,Haonan Qiu,Paul Debevec,Ziwei Liu

Main category: cs.CV

TL;DR: 提出VChain,一种在推理时通过多模态大模型生成关键帧来引导视频生成的链式视觉思维框架,提升复杂动态视频的生成质量。

Details Motivation: 现有视频生成模型难以合成具有连贯因果链的复杂动态,而多模态模型具备较强的视觉状态推理能力,因此需要结合二者优势。 Method: 构建VChain框架,利用多模态大模型生成稀疏关键帧作为视觉推理信号,并在这些关键时间点对预训练视频生成器进行稀疏的推理时调优。 Result: 在复杂多步场景中实验表明,VChain显著提升了生成视频的质量,且调优高效、开销小、无需密集监督。 Conclusion: VChain有效融合了多模态模型的推理能力与视频生成模型的渲染能力,为复杂动态视频生成提供了高效可行的新思路。 Abstract: Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.