Skip to content

Table of Contents

cs.CL [Back]

[1] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

Pranav Bhandari,Nicolas Fay,Sanjeevan Selvaganapathy,Amitava Datta,Usman Naseem,Mehwish Nasim

Main category: cs.CL

TL;DR: 提出了一种通过大五人格特质从Transformer模型中提取隐藏状态并实现人格对齐的新型管道,利用低秩子空间发现方法识别不同架构中的最优层,进而通过动态层选择进行灵活的行为引导,有效控制LLM输出中的人格表达。

Details Motivation: 大语言模型在生成过程中表现出隐式人格,但如何可靠地控制或对齐这些人格特征仍是一个开放问题。现有研究缺乏在生成过程中有效操控模型行为的机制,且人格心理构念与模型内部表征之间的关系尚不明确。 Method: 基于大五人格特质(OCEAN),从Transformer各层提取隐藏状态激活,应用低秩子空间发现技术,识别不同模型架构中与特定人格特质相关的关键层,并构建支持动态层选择的灵活引导框架,实现对模型输出人格特征的精准调控。 Result: 发现人格特质存在于一个低秩共享子空间中,这些潜在结构可通过精细扰动转化为可操作的引导机制,在不影响语言流畅性、多样性及整体能力的前提下,实现对LLM人格表达的有效控制。 Conclusion: 该研究弥合了心理学理论与实际模型对齐之间的差距,验证了人格感知LLM的可行性,为人格可控的大语言模型提供了可扩展且鲁棒的技术路径。 Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.

[2] TextualVerifier: Verify TextGrad Step-by-Step

Eugenius Mario Situmorang,Adila Alfa Krisnadhi,Ari Wibisono

Main category: cs.CL

TL;DR: 本文提出了TextualVerifier,一个用于TextGrad的文本自验证框架,通过大语言模型的链式思维推理和多数投票机制提升文本决策的有效性。

Details Motivation: TextGrad缺乏确保文本决策合理性的自我验证机制,限制了其在复杂AI系统中的可靠性。 Method: 设计了一个四阶段工作流:链式思维分解、变体生成、多数投票和共识聚合,并非侵入式地集成到TextGrad的损失函数和优化结果验证中。 Result: 在PRM800K上单独评估时,推理步骤有效性提升29%;与TextGrad集成后,在多个基准测试中准确率显著提高,平均增加2.2至10.71个百分点,伴随适度的LLM调用开销。 Conclusion: TextualVerifier是首个无需数值梯度即可为TextGrad提供自我验证的框架,增强了文本优化系统的可靠性,并开辟了新的验证研究方向。 Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.

[3] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Stergios Chatzikyriakidis,Dimitris Papadakis,Sevasti-Ioanna Papaioannou,Erofili Psaltaki

Main category: cs.CL

TL;DR: 本文介绍了扩展的希腊方言数据集GRDD+,该数据集在原有基础上增加了多种希腊方言,并成为迄今为止规模最大、方言种类最多的数据集。研究还评估了高质量方言数据对多种大语言模型的影响。

Details Motivation: 为了弥补现有希腊方言数据集在方言种类和数据量上的不足,推动方言自然语言处理的研究。 Method: 通过收集并整合更多希腊方言的数据,构建了包含10种方言、总计637万词的GRDD+数据集,并对多种大语言模型进行微调实验以评估其性能。 Result: GRDD+是目前最大且方言最丰富的希腊语数据集;微调实验表明,高质量的方言数据显著提升了大语言模型在方言处理任务中的表现。 Conclusion: GRDD+为希腊方言研究提供了重要资源,验证了方言数据质量对大语言模型性能的关键作用。 Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

[4] PLLuM: A Family of Polish Large Language Models

Jan Kocoń,Maciej Piasecki,Arkadiusz Janz,Teddy Ferdinan,Łukasz Radliński,Bartłomiej Koptyra,Marcin Oleksy,Stanisław Woźniak,Paweł Walkowiak,Konrad Wojtasik,Julia Moska,Tomasz Naskręt,Bartosz Walkowiak,Mateusz Gniewkowski,Kamil Szyc,Dawid Motyka,Dawid Banach,Jonatan Dalasiński,Ewa Rudnicka,Bartłomiej Alberski,Tomasz Walkowiak,Aleksander Szczęsny,Maciej Markiewicz,Tomasz Bernaś,Hubert Mazur,Kamil Żyta,Mateusz Tykierko,Grzegorz Chodak,Tomasz Kajdanowicz,Przemysław Kazienko,Agnieszka Karlińska,Karolina Seweryn,Anna Kołos,Maciej Chrabąszcz,Katarzyna Lorenc,Aleksandra Krasnodębska,Artur Wilczek,Katarzyna Dziewulska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Daria Mikoś,Maciej Trzciński,Dawid Krutul,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Michał Perełkiewicz,Małgorzata Grębowiec,Maciej Kazuła,Marcin Białas,Roman Roszko,Danuta Roszko,Jurgita Vaičenonienė,Andrius Utka,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Maciej Ogrodniczuk,Monika Borys,Anna Bulińska,Wiktoria Gumienna,Witold Kieraś,Dorota Komosińska,Katarzyna Krasnowska-Kieraś,Łukasz Kobyliński,Martyna Lewandowska,Marek Łaziński,Mikołaj Łątkowski,Dawid Mastalerz,Beata Milewicz,Agnieszka Anna Mykowiecka,Angelika Peljak-Łapińska,Sandra Penno,Zuzanna Przybysz,Michał Rudolf,Piotr Rybak,Karolina Saputa,Aleksandra Tomaszewska,Aleksander Wawer,Marcin Woliński,Joanna Wołoszyn,Alina Wróblewska,Bartosz Żuk,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Jakub Kwiatkowski,Piotr Pęzik

Main category: cs.CL

TL;DR: PLLuM是波兰首个大规模开源波兰语大语言模型系列,由多个主要研究机构联合开发,旨在弥补非英语语言在大模型发展中的不足。

Details Motivation: 由于当前大语言模型主要以英语为中心,其他语言支持有限,因此需要开发高质量、透明且符合本地文化需求的非英语模型。 Method: 构建了1400亿token的波兰语文本语料库用于预训练,并创建了7.7万条定制指令数据集和10万条偏好优化数据集;采用负责任AI框架,包含严格的数据治理及混合式输出修正与安全过滤模块。 Result: 成功开发出基础模型及指令微调版本,在公共管理领域的下游任务中表现出良好实用性,并公开发布以促进开放研究和波兰自主AI技术发展。 Conclusion: PLLuM填补了波兰语大模型的空白,推动了多语言AI的发展,同时强调了数据安全与文化适配的重要性。 Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

[5] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

Mohammad Atif Quamar,Mohammad Areeb,Mikhail Kuznetsov,Muslum Ozgur Ozmen,Z. Berkay Celik

Main category: cs.CL

TL;DR: 提出了一种名为STARS的解码时算法,通过分段采样、评分和拒绝/接受机制,在生成过程中实现高效且高质量的语言模型对齐。

Details Motivation: 现有对齐方法如微调计算成本高,而推理时方法如Best-of-N在实际中计算开销过大,难以达到最优对齐效果。 Method: 提出STARS:一种基于拒绝采样的分词段对齐方法,通过在解码时迭代地对短固定大小的token段进行采样、评分和接受/拒绝,实现生成路径的早期纠偏。 Result: 在六个大语言模型上验证,STARS在胜率上比监督微调最高提升14.9个百分点,比直接偏好优化高4.3个百分点,同时与强Best-of-N基线相当。 Conclusion: STARS提供了一种可推广、鲁棒且高效的LLM对齐方案,是传统微调和全序列排序方法的有效替代。 Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.

[6] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

Mikołaj Langner,Jan Eliasz,Ewa Rudnicka,Jan Kocoń

Main category: cs.CL

TL;DR: 提出一种基于二元决策的高效多标签文本分类方法,通过将分类任务分解为独立的是/否查询,并结合前缀缓存机制,在不损失准确率的情况下显著提升推理效率。

Details Motivation: 传统多标签分类在大语言模型中效率较低,尤其在输出结构复杂时;需要一种更高效且可扩展的分类框架。 Method: 将多标签分类任务重构为一系列独立的二元(是/否)判断问题,每个标签维度单独查询,并利用prefix caching优化推理;采用LLM-to-SLM知识蒸馏,用大模型生成多标注数据来微调小模型。 Result: 在24个情感维度上验证了方法的有效性,微调后的小模型显著优于零样本基线,尤其在训练中见过的标签上表现更好;该方法提升了推理速度并保持准确性。 Conclusion: 将多标签分类分解为二元查询,结合知识蒸馏与缓存感知推理,构成了一种可扩展、高效的LLM-based分类框架,具有跨领域的通用性。 Abstract: We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.

[7] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Hellina Hailu Nigatu,Bethelhem Yemane Mamo,Bontu Fufa Balcha,Debora Taye Tesfaye,Elbethel Daniel Zewdie,Ikram Behiru Nesiru,Jitu Ewnetu Hailu,Senait Mengesha Yayo

Main category: cs.CL

TL;DR: 本文研究了三种低资源语言(Afan Oromo、Amharic 和 Tigrinya)的机器翻译数据集质量,重点关注性别表征问题,发现数据集中存在严重的男性偏向以及对女性的有害和毒性描绘,强调数量不等于质量,呼吁在构建低资源语言数据集时重视内容质量与社会影响。

Details Motivation: 随着低资源语言越来越多地被纳入NLP研究,大规模数据集的收集受到重视,但往往重数量轻质量,可能导致技术性能差并传播社会偏见。本文旨在揭示这种趋势下的数据质量问题,特别是性别表征偏差。 Method: 分析三种低资源语言(Afan Oromo、Amharic、Tigrinya)的机器翻译训练数据与基准数据集,比较其领域分布,并系统考察人名、动词语法性别及文本描述中的性别倾向与刻板印象。 Result: 发现训练数据多来自政治和宗教领域,而基准数据集中于新闻、健康和体育;数据普遍存在男性主导现象,并包含对女性的有害和毒性描绘,且数据量最大的语言此类问题更严重。 Conclusion: 数据数量不能保证质量,当前低资源语言数据集存在显著的性别偏差和潜在危害,需在数据收集阶段就引入质量审查与有害内容缓解机制。 Abstract: As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.

[8] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

Manh Nguyen,Sunil Gupta,Dai Do,Hung Le

Main category: cs.CL

TL;DR: 提出了一种无需训练的解码时方法GRAD,通过构建稀疏的词元转移图,利用语料库中的统计证据来减少大语言模型的幻觉问题。

Details Motivation: 现有的基于提示或检索外部知识的方法存在脆弱性、领域敏感性或高成本问题,因此需要一种轻量级且有效的方法来在生成过程中融入证据以减少幻觉。 Method: GRAD在解码时构建一个稀疏的词元转移图,通过单次前向传播累积检索到的小型语料库中的下一个词元logits,并将这些图检索到的logits与模型原始logits进行最大归一化和自适应融合,以支持高证据延续的同时保持生成流畅性。 Result: 在三个不同模型和多种问答基准(包括内在/外在幻觉和事实性任务)上的实验表明,GRAD相比贪婪解码提升了最高9.7%的内在准确率,降低8.6%的幻觉率,提高6.9%的正确性,并在所有方法中取得最高的真实-信息量乘积得分。 Conclusion: GRAD提供了一种轻量、即插即用的方法,证明了来自语料库级别的词元转移统计证据能有效引导生成更真实、可验证的输出,是对比解码和知识图谱增强的有力替代方案。 Abstract: Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.

[9] Context informs pragmatic interpretation in vision-language models

Alvin Wei Ming Tan,Ben Prystawski,Veronica Boyce,Michael C. Frank

Main category: cs.CL

TL;DR: 研究了人类与视觉-语言模型在迭代指代游戏中的表现,发现相关语境显著提升模型性能,但抽象指代的少样本任务对模型仍具挑战。

Details Motivation: 探索智能体在多轮语言环境中进行上下文敏感的语用推理能力。 Method: 通过改变上下文的数量、顺序和相关性,在迭代指代游戏中测试人类与视觉-语言模型的表现。 Result: 无相关上下文时,模型表现高于随机但远差于人类;有相关上下文时,模型表现随轮次显著提升。 Conclusion: 相关上下文对模型性能至关重要,但当前模型在少样本、抽象指代任务上仍有局限。 Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.

[10] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023

Stefano M. Iacus,Devika Jain,Andrea Nasuto,Giuseppe Porro,Marcello Carammia,Andrea Vezzulli

Main category: cs.CL

TL;DR: 提出基于26亿条美国推文的人类繁荣地理指数(HFGI),利用大语言模型分析多维度幸福感,提供高时空分辨率的社会福祉洞察。

Details Motivation: 现有衡量人类繁荣的指标缺乏足够的时空精细度,且多依赖传统经济指标,难以全面反映社会福祉的真实状况。 Method: 基于2013-2023年约26亿条地理标记的美国推文,使用微调的大语言模型识别与哈佛全球繁荣研究框架对齐的48个指标,并新增移民态度与腐败感知,生成县和州级别的月度与年度指数。 Result: 构建了具有高时空分辨率的人类繁荣地理指数(HFGI),验证显示其能准确反映潜在构念,并与既有指标呈现预期相关性。 Conclusion: HFGI为跨学科研究美国社会福祉、不平等与社会变迁提供了高分辨率的新资源,揭示了过去十年社交媒体话语中人类繁荣的动态变化。 Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard's Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.

[11] Direct Semantic Communication Between Large Language Models via Vector Translation

Fu-Chun Yang,Jason Eshraghian

Main category: cs.CL

TL;DR: 提出了一种通过向量转换在大语言模型之间建立潜在语义桥梁的方法,实现跨模型的语义传递,提升多智能体系统中的信息交换效率。

Details Motivation: 现有方法在多智能体通信中仅传递token,丢失了大量潜在语义,限制了信息传输效率并增加计算开销。 Method: 设计双编码器翻译器,在Llama-2-7B和Mistral-7B-Instruct之间学习映射关系,实现潜在空间的向量翻译,并以30%混合强度注入目标模型。 Result: 平均余弦对齐度达到0.538,双向评估显示2.01:1的传输不对称性,表明通用模型比指令调优模型更具可迁移性。 Conclusion: 跨模型潜在通信是可行的,能够在保持计算稳定的同时实现语义级协作,为多智能体AI系统提供了新路径。 Abstract: In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model's generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.

[12] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises

Shiyin Lin

Main category: cs.CL

TL;DR: 本文提出了一种将溯因推理集成到检索增强型大语言模型中的框架,通过检测证据不足、生成候选前提并进行一致性与合理性验证,提升了回答准确性和推理可信度。

Details Motivation: 现有的检索增强生成(RAG)系统在检索到的证据不完整时容易失败,导致推理过程中出现空白。因此,需要一种能够填补这些空白的机制以提升系统的鲁棒性。 Method: 提出一个包含三个步骤的框架:检测证据不足、生成可能的缺失前提,并通过一致性和合理性检查来验证这些前提。该方法结合了检索增强与溯因推理机制。 Result: 在溯因推理和多跳问答基准上的实验表明,该方法在答案准确性和推理忠实性方面均优于现有方法。 Conclusion: 将溯因推理引入RAG系统是一种有效提升知识密集型任务中模型鲁棒性和可解释性的可行方向。 Abstract: Large Language Models (LLMs) enhanced with retrieval -- commonly referred to as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} -- the process of generating plausible missing premises to explain observations -- offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.

[13] WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao,Chenda Liao,Changliang Liu,Matthew Wiesner,Leibny Paola Garcia,Daniel Povey,Sanjeev Khudanpur,Jian Wu

Main category: cs.CL

TL;DR: 提出了一种弱监督的WST模型,用于在高错误率转录文本下保持语音识别性能,无需额外置信度估计或预训练模型。

Details Motivation: RNN-T依赖大量高质量标注数据,获取成本高,因此需要一种能利用弱标注或含错误文本进行训练的方法。 Method: 设计了一种灵活的训练图结构,集成到Transducer框架中,使其能鲁棒地处理转录错误,无需辅助模型或置信度校准。 Result: 在合成和工业数据集上验证,即使转录错误率达70%,WST仍能保持良好性能,且优于现有的CTC类弱监督方法(如BTC、OTC)。 Conclusion: WST具有强鲁棒性和实用价值,适用于现实场景中的端到端语音识别任务。 Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

[14] T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Shreya Havaldar,Helen Jin,Chaehyeon Kim,Anton Xue,Weiqiu You,Marco Gatti,Bhuvnesh Jain,Helen Qu,Daniel A Hashimoto,Amin Madani,Rajat Deo,Sameed Ahmed M. Khatana,Gary E. Weissman,Lyle Ungar,Eric Wong

Main category: cs.CL

TL;DR: 提出T-FIX基准,用于评估大语言模型在知识密集型领域中解释与专家判断的一致性。

Details Motivation: 现有解释评估方法侧重于合理性或内部忠实性,无法反映解释是否符合领域专家的直觉,因此需要一种衡量专家对齐的新标准。 Method: 与多个知识密集型领域的专家合作,构建覆盖七个领域的T-FIX基准,并设计新指标来量化LLM解释与专家判断的对齐程度。 Result: 开发了T-FIX基准和新的评估指标,能够有效衡量LLM生成的解释在专家视角下的对齐水平。 Conclusion: T-FIX为评估LLM解释提供了更贴近真实专家需求的标准,推动了解释性评估从‘看似合理’向‘真正专业一致’的转变。 Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.

[15] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

Xinying Qian,Ying Zhang,Yu Zhao,Baohang Zhou,Xuhui Sui,Xiaojie Yuan

Main category: cs.CL

TL;DR: 本文提出了一种名为PoK的框架,结合知识规划与对比时序检索,提升大语言模型在时序知识图谱问答中的推理准确性和可解释性。

Details Motivation: 现有方法在处理时间敏感问题时难以充分理解复杂的时间约束语义,且大语言模型存在幻觉和知识缺失问题,限制了其时序推理能力。 Method: 提出Plan of Knowledge模块,将复杂时序问题分解为子目标序列,并构建带对比检索机制的时序知识库(TKS),实现语义和时间对齐的事实检索。 Result: 在四个基准数据集上实验表明,PoK显著提升了检索精度和推理准确性,最高超越现有最先进方法56.0%。 Conclusion: 结合结构化规划与时序知识检索能有效增强大语言模型在TKGQA任务中的事实一致性和推理性能。 Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.

[16] The truth is no diaper: Human and AI-generated associations to emotional words

Špela Vintar,Jan Jona Javoršek

Main category: cs.CL

TL;DR: 比较人类与大语言模型在情感词汇联想上的行为,发现两者有中等程度的重叠,但大语言模型的联想更可预测、创造性较低,并倾向于放大情感负荷。

Details Motivation: 探究大语言模型是否以与人类相似的方式生成词汇联想,特别是在情感词汇上的表现,以理解其创造性与心理机制的模拟能力。 Method: 通过对比人类受试者与大语言模型对情感词汇的联想反应,分析其关联模式、情感强度变化及创造性水平。 Result: 人类与大语言模型的联想重叠程度中等;大语言模型的联想更具可预测性,但创造性较低,并倾向于放大输入词的情感色彩。 Conclusion: 大语言模型在词汇联想上虽表现出一定类似人类的行为,但在创造性和情感调节方面仍存在显著差异,显示出其联想过程的机械性与局限性。 Abstract: Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.

[17] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

Eva Prakash,Maayane Attias,Pierre Chambon,Justin Xu,Steven Truong,Jean-Benoit Delbrouck,Tessa Cook,Curtis Langlotz

Main category: cs.CL

TL;DR: 本研究通过大规模训练数据和跨机构验证,提升了基于transformer的放射学报告去标识化模型性能,并在保护健康信息(PHI)检测上超越了现有学术与商业系统。

Details Motivation: 为了提高放射学报告中受保护健康信息(PHI)自动去标识化的准确性和泛化能力,解决现有方法在跨机构数据上的局限性。 Method: 基于先进的transformer模型,在斯坦福大学的两个大型标注放射学语料库上进行微调,涵盖多种影像类型,并引入新的AGE类别;使用‘明示隐藏’方法生成合成PHI以评估稳定性,并在宾夕法尼亚大学的数据集上测试模型性能,同时与商业云服务进行对比。 Result: 模型在宾夕法尼亚大学和斯坦福大学测试集上的整体F1分数分别为0.973和0.996,优于或持平先前最优模型;合成PHI检测F1达0.959,且在50个独立去标识化数据集中表现稳定;在合成数据上显著优于所有商业系统(F1: 0.960 vs. 0.632–0.754)。 Conclusion: 基于多样化放射学数据训练的transformer去标识化模型在PHI检测方面表现优越,具备良好的跨机构泛化能力和隐私保护效果,为临床文本安全处理设立了新基准。 Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.

[18] A Characterization of List Language Identification in the Limit

Moses Charikar,Chirag Pabbaraju,Ambuj Tewari

Main category: cs.CL

TL;DR: 本文研究了在极限情况下的语言识别问题,引入k个猜测列表的机制,给出了可k-列表识别的语言集合的精确刻画,并证明其等价于将语言集合分解为k个可识别子集。同时,在统计设定下分析了识别速率,表明可识别时收敛速率为指数级,否则无法实现任意趋零速率。

Details Motivation: 受近期语言生成问题取得的积极成果启发,重新审视经典的语言识别问题,探索当学习者被允许在每一步输出k个猜测时,是否能突破Gold定理的限制,实现更广泛语言类别的识别。 Method: 基于Angluin对单猜测语言识别的刻画,提出递归版本的条件,建立k-列表识别在极限下的充要条件;进一步通过分解语言集合为k个可识别子集的方式给出概念上简洁的等价刻画,并扩展至统计设定下分析识别速率。 Result: 1) 给出了语言集合可k-列表识别在极限下的精确特征;2) 证明该条件等价于语言集合可分解为k个可识别子集;3) 在i.i.d.统计设定下,若可k-列表识别,则能达到指数收敛速率,且这是最优的;4) 若不可识别,则无法实现任何趋于零的识别速率。 Conclusion: 通过引入k-列表猜测机制,显著扩展了经典语言识别的可行性范围,提供了清晰的理论边界,并揭示了识别能力与收敛速率之间的根本关系。 Abstract: We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.

[19] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Wenmo Qiu,Saurabh Srivastava

Main category: cs.CL

TL;DR: 批处理不仅提高了大推理模型的推理效率,还作为一种推理时正则化手段,抑制过度思考、减少犹豫性语言,并促进更果断的回答,同时在多个基准上显著提升准确率并减少3-5倍的推理token使用。

Details Motivation: 探索批处理在大语言模型中的潜在优势,尤其是其在多步推理过程中对模型行为的正则化作用,而不仅仅是作为推理加速手段。 Method: 在13个多样化基准上进行综合研究,分析批处理对推理准确性、token使用和模型行为(如过度思考、自我修正)的影响,并观察批处理中的集体泛化效应。 Result: 批处理显著提升推理准确性,减少3-5倍的推理token消耗,抑制过度思考和犹豫性语言,增强答案的决断性,并观察到模型能在同一批次中从前例泛化以解决更难题目。 Conclusion: 批处理不仅是提高吞吐量的优化技术,更是一种有效的推理时正则化方法,可提升大推理模型的效率与可靠性。 Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.

[20] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Xinyuan Li,Murong Xu,Wenbiao Tao,Hanlun Zhu,Yike Zhao,Jipeng Zhang,Yunshi Lan

Main category: cs.CL

TL;DR: 提出RIDE框架,利用项目反应理论(IRT)和强化学习生成更具挑战性的数学问题变体,以更严格地评估大语言模型的数学推理能力。

Details Motivation: 现有基于规则的对抗性扰动方法常生成不合理的问题,难以系统评估问题难度,且可能高估模型的真实数学推理能力。 Method: 提出RIDE框架:利用35个LLM模拟学生作答,构建基于IRT的难度排序器,并通过强化学习指导问题重写模型生成跨难度级别的新问题。 Result: 在竞赛级数学基准上应用RIDE后,26个先进LLM的性能平均下降21.73%,表明其数学推理能力的鲁棒性有限。 Conclusion: RIDE能有效生成高质量、更具挑战性的问题,为评估大语言模型的真实数学推理能力提供了可靠且可扩展的对抗性评测方法。 Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.

[21] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Dazhong Chen,Yi-Cheng Lin,Yuchen Huang,Ziwei Gong,Di Jiang,Zeying Xie,Yi R.,Fung

Main category: cs.CL

TL;DR: 提出CantoASR,一种结合ASR与大音频语言模型的协作框架,通过强制对齐、LoRA微调Whisper和指令调优Qwen-Audio实现粤语语音识别的纠错,显著提升低资源方言场景下的识别准确率。

Details Motivation: 粤语作为低资源语言,存在标注数据少、六声调、连读变调和口音差异等问题,导致现有ASR系统(如Whisper)词错误率高,难以有效处理声调和韵律信息。 Method: 构建CantoASR框架:1)利用强制对齐提取声学特征;2)使用LoRA微调Whisper以增强声调区分能力;3)采用指令调优的Qwen-Audio进行韵律感知的错误纠正,实现ASR与大音频语言模型的协同纠错。 Result: 在自发性粤语数据上的实验表明,相比Whisper-Large-V3,CantoASR显著降低了字符错误率(CER),验证了融合声学线索与大模型上下文推理的有效性。 Conclusion: 将显式声学特征(如声调和韵律)与大音频语言模型的上下文推理相结合,为低资源声调语言和方言的语音识别提供了一种可扩展且有效的解决方案。 Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

[22] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

Fahim Ahmed,Md Mubtasim Ahasan,Jahir Sadik Monon,Muntasir Wahed,M Ashraful Amin,A K M Mahbubur Rahman,Amin Ahsan Ali

Main category: cs.CL

TL;DR: 本文探索了三种多智能体LLM流水线,用于提升小模型在Text-to-SQL任务上的性能,实验表明多智能体协作能显著提高执行准确率。

Details Motivation: 现有大语言模型在从自然语言生成SQL方面表现不佳,尤其是面对大规模模式和复杂推理时,且多数研究集中于复杂且不实用的方案,忽视了小型高效模型的潜力。 Method: 提出了三种多智能体LLM流水线:多智能体讨论流水线、Planner-Coder流水线和Coder-Aggregator流水线,并在多个开源模型上进行系统性性能基准测试。 Result: 在Bird-Bench Mini-Dev数据集上的实验表明,多智能体讨论可使小模型(如Qwen2.5-7b-Instruct)的执行准确率提升高达10.6%;其中LLM Reasoner-Coder流水线效果最好,使用DeepSeek-R1-32B和QwQ-32B作为规划器将Gemma 3 27B IT的准确率从52.4%提升至56.4%。 Conclusion: 多智能体协作框架能有效提升小规模模型在Text-to-SQL任务中的表现,为高效、实用的文本到SQL系统提供了可行路径。 Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.

[23] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

Mohammed Musthafa Rafi,Adarsh Krishnamurthy,Aditya Balu

Main category: cs.CL

TL;DR: 本文提出了LAAC(大型语言模型作为沟通中介)框架,旨在通过结构化对话捕捉发送者意图并促进真实知识交流,打破AI生成内容的膨胀与压缩循环,但在高风险场景中部署前仍需解决信息保真度、可重复性和响应可靠性等信任问题。

Details Motivation: 随着AI生成内容泛滥,沟通陷入‘冗长生成-自动摘要’的虚假循环,双方不再接触真实内容,亟需一种能确保意图准确传递且可信的新型沟通范式。 Method: 提出LAAC多智能体架构,系统评估三大信任维度:信息捕获保真度、可重现性与查询响应完整性,并在多种通信场景中进行受控实验。 Result: 实验揭示了当前LAAC在不同通信类型中的信任差距,特别是在意图提取准确性、知识表达一致性及避免幻觉方面存在显著挑战。 Conclusion: LAAC有潜力重塑AI时代的沟通方式,但其作为可信中介的前提是必须建立严格的信任保障机制,未来需进一步优化模型在关键场景下的可靠性与稳定性。 Abstract: The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender's intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC's deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC's multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.

[24] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Nicolò Pagan,Petter Törnberg,Christopher A. Bail,Anikó Hannák,Christopher Barrie

Main category: cs.CL

TL;DR: 本文提出了一种计算图灵测试框架,用于评估大语言模型(LLM)生成文本与人类语言的接近程度,并比较了九种开源LLM在不同校准策略下的表现,发现即使经过校准,LLM输出仍明显区别于人类文本,尤其在情感表达方面。

Details Motivation: 现有对LLM模拟人类行为的研究依赖未经充分验证的假设,且主要使用不可靠的人类判断进行评估,缺乏有效的量化工具来衡量LLM生成文本的真实性。 Method: 提出一种结合聚合指标(如BERT-based可检测性和语义相似性)与可解释语言特征(如风格标记和话题模式)的计算图灵测试框架,并系统评估九种开放权重LLM在五种校准策略下的表现,涵盖X、Bluesky和Reddit平台的用户交互数据。 Result: 发现经校准后的LLM仍显著区别于人类文本,特别是在情感和情绪表达上;指令微调模型表现不如基础模型,增大模型规模并未提升人类相似度,且人类相似性与语义保真度之间存在权衡。 Conclusion: 当前LLM在模拟人类交流方面存在局限性,需谨慎使用;研究提供了可扩展的验证与校准框架,为未来LLM在社会科学中的应用提供了重要基础。 Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.

[25] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Michał Karp,Anna Kubaszewska,Magdalena Król,Robert Król,Aleksander Smywiński-Pohl,Mateusz Szymański,Witold Wydmański

Main category: cs.CL

TL;DR: 本研究评估了当前大语言模型(LLM)是否能通过波兰国家上诉委员会的官方资格考试,发现尽管LLM在知识测试中表现尚可,但在实际判案写作部分均未达标,且“LLM作为评委”的自动评分方式与官方评分存在偏差。

Details Motivation: 探索大语言模型在法律资格考试中的应用潜力,并评估其在真实法律决策任务中的可靠性与局限性。 Method: 将LLM作为考生参与考试,并采用'LLM-as-a-judge'方法由模型自动评分;构建混合信息检索与提取管道,在闭卷及多种检索增强生成(RAG)设置下测试多个LLM。 Result: LLM在多项选择的知识测试中得分满意,但在书面判决部分均未达到通过标准,且LLM自动评分结果与官方评审存在显著差异。 Conclusion: 当前的大语言模型尚不能替代人类法官或独立考官在波兰政府采购裁决中的角色,主要受限于幻觉、错误引用法律条文、逻辑论证薄弱等问题,需法律专家与技术团队紧密协作改进。 Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.

[26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs

Liran Cohen,Yaniv Nemcovesky,Avi Mendelson

Main category: cs.CL

TL;DR: 本文提出了一种名为REMIND的新方法,用于评估机器遗忘效果,通过分析模型在输入微小变化下的损失动态来检测被遗忘数据的残余影响,相较于传统单点评估方法更敏感且鲁棒。

Details Motivation: 现有的机器遗忘评估方法通常只在单个输入层面进行,可能忽略语义相似样本中的残余记忆影响,导致隐私泄露。因此需要一种更精细、可靠的评估手段。 Method: REMIND通过查询模型在目标数据邻域内的损失变化模式,识别被遗忘数据周围的平坦损失景观,从而判断是否真正遗忘;仅需黑盒查询访问,适用于多种模型和数据。 Result: 实验表明,被遗忘的数据呈现出更平坦的损失曲面,而保留或无关数据则具有更尖锐的变化;REMIND在不同模型、数据集和改写输入下均优于现有方法,具备良好鲁棒性。 Conclusion: REMIND提供了一个更敏感、可解释且实用的机器遗忘评估框架,有助于提升语言模型在隐私与合规方面的可信度。 Abstract: Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.

[27] Reusing Pre-Training Data at Test Time is a Compute Multiplier

Alex Fang,Thomas Voice,Ruoming Pang,Ludwig Schmidt,Tom Gunter

Main category: cs.CL

TL;DR: 研究表明,当前的预训练方法未能充分利用现有数据集中的信息,通过结合检索增强生成和测试时计算,可显著提升模型性能,表明未来在预训练效率方面有较大改进空间。

Details Motivation: 尽管大语言模型通过大规模预训练获得了多种任务的解决能力,但目前对预训练过程从数据中提取知识的效率缺乏深入理解,因此需要量化预训练过程中被遗漏的数据价值。 Method: 采用检索增强生成(RAG)结合测试时计算的方法,评估在不同规模下预训练未能充分利用的数据价值,并在MMLU、Math-500和SimpleQA等基准上进行验证,同时进行去污染分析以确保结果有效性。 Result: 在标准且开源的数据集上,预训练后结合检索显著提升了MMLU、Math-500和SimpleQA的准确率;在MMLU上,检索相当于将计算效率提升约5倍;进一步利用测试时计算解析检索内容,使LLaMA 3.1 8B模型在MMLU上提升了10个百分点。 Conclusion: 当前的预训练方法未能充分挖掘现有数据集中的知识,结合检索与测试时计算可有效释放被遗漏的价值,表明未来可通过改进训练或推理策略大幅提升模型性能。 Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

[28] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

Salma Mekaoui,Hiba Sofyan,Imane Amaaz,Imane Benchrif,Arsalane Zarghili,Ilham Chaker,Nikola S. Nikolov

Main category: cs.CL

TL;DR: 本文提出一种基于图的高效主题标注方法,通过语义扩展和关系分析为话题模型生成可解释的标签,在性能上优于传统基准并接近ChatGPT-3.5,同时保持较低计算开销。

Details Motivation: 现有主题建模方法生成的主题词缺乏可解释性,且许多先进方法计算成本高,需要一种低资源但有效的主题标注方案。 Method: 提出一种基于图的方法,利用语义相关词扩展主题词,并通过分析词间关系构建图结构,从中提取有意义的主题标签。 Result: 在两个数据集上与多个基准方法(包括ChatGPT-3.5)对比,该方法在BERTScore和余弦相似度上均优于传统基准,效果接近ChatGPT-3.5,且计算效率更高。 Conclusion: 所提出的图方法在保持低计算成本的同时,能有效提升主题标注的可解释性和准确性,具有实际应用潜力,并为未来研究提供了方向。 Abstract: Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.

[29] SSPO: Subsentence-level Policy Optimization

Kun Yang,Zikang chen,Yanmeng Wang,Zhigen Li

Main category: cs.CL

TL;DR: 本文提出了一种新的强化学习方法SSPO,用于提升大语言模型的推理能力。该方法通过引入句子级重要性比率,在GRPO和GSPO之间取得平衡,有效避免了训练崩溃、高方差以及采样数据利用率低的问题,并结合句子熵动态调整PPO-CLIP的裁剪边界,在多个数据集上取得了优于现有方法的表现。

Details Motivation: 现有的RLVR方法如GRPO和GSPO在策略更新稳定性或采样数据利用效率方面存在缺陷:GRPO因基于token级别的重要性比率易受异常值影响导致训练崩溃,而GSPO虽降低方差但因整个响应共享同一比率导致极端值影响整体判断,造成数据浪费。因此需要一种更平衡的方法。 Method: 提出SSPO(Sentence-level Sequence Policy Optimization),采用句子级别的重要性比率,既避免了token级的高方差问题,又防止整段响应因单一极端值被误弃;同时,在PPO-CLIP中引入句子熵来自适应调整裁剪边界,鼓励高熵token探索并限制低熵token的更新范围。 Result: SSPO在五个数据集上的平均得分为46.57,优于GRPO(43.01)和GSPO(44.42),并在其中三个数据集上达到最先进的性能,验证了其在提升数据利用效率和训练稳定性方面的优势。 Conclusion: SSPO通过句子级重要性比率和基于句子熵的动态裁剪机制,成功结合了GSPO的优点并克服其缺点,在保持训练稳定的同时提高了采样数据的利用率,是RLVR框架下改进LLM推理能力的有效方案。 Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs' reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO's effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.

[30] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh,Mohammad Javad Dousti

Main category: cs.CL

TL;DR: 提出一种基于学习性评分和批量选择策略的数据选择方法,用于机器翻译系统的微调,显著提升数据效率和计算效率。

Details Motivation: 提高机器翻译模型的性能需要高质量和高效的数据选择,现有方法在数据利用效率和训练效果方面存在不足。 Method: 通过结合学习模型和预训练参考模型,定义学习性评分来评估数据点的训练价值,并采用考虑数据点间依赖关系的批量选择策略进行数据筛选。 Result: 在多个语言对(如英译波斯语)上的实验表明,相比随机选择基线,该方法数据效率提升达五倍,使用缓存嵌入时计算效率提高24倍,并能实现更好的泛化性能和翻译质量。 Conclusion: 所提出的数据选择方法能有效提升机器翻译微调过程中的数据利用率、计算效率和翻译性能,具有较强的实用性和推广价值。 Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.

[31] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

Lars Bungum,Charles Yijia Huang,Abeer Kashar

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)在1940年时间背景下的推理能力,使用一本挪威语 trivia 书籍中的问题,以英语和挪威语进行提示,并通过LLM评分与人工核对评估表现。结果表明,英语提示效果优于挪威语,且更大的模型表现更好。

Details Motivation: 探索大语言模型在历史时间背景下进行准确时间推理的能力,特别是当知识可能受训练数据影响时,如何还原特定年代的认知状态。 Method: 使用1940年的一本挪威 trivia 书籍中的问题,分别用英语和挪威语向多个LLM(包括DeepSeek-R1、Gemma3、Qwen3、Llama3.1及专为挪威语设计的最大LLM)提问,并要求其以1940年的视角作答;采用LLM-as-judge方式进行评分,并由母语者抽样验证。 Result: 英语提示的答题准确率 consistently 高于挪威语提示,这一结果出乎意料;同时,模型规模越大,表现越好。 Conclusion: 尽管模型经过现代数据训练,仍可在一定程度上模拟过去的时间背景;提示语言的选择(英语优于挪威语)和模型规模是影响时间推理性能的关键因素。 Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.

[32] Probabilistic Textual Time Series Depression Detection

Fabian Schmidt,Seyedehmoniba Ravan,Vladimir Vlassov

Main category: cs.CL

TL;DR: 提出PTTSD框架,用于从临床访谈文本中预测抑郁严重程度(PHQ-8分数),结合LSTM、自注意力和残差结构,具有高准确性和良好校准的不确定性估计。

Details Motivation: 现有抑郁严重程度预测模型缺乏不确定性估计和时间建模能力,限制了其在临床决策中的可解释性和可靠性。 Method: PTTSD框架采用序列到序列和序列到一两种结构,结合双向LSTM、自注意力机制和残差连接,输出层使用高斯或Student-t分布,通过负对数似然训练,实现带不确定性估计的时间序列预测。 Result: 在E-DAIC和DAIC-WOZ数据集上达到文本模型最优性能(如E-DAIC上MAE=3.85,DAIC上MAE=3.55),预测区间校准良好,消融实验验证了注意力和概率建模的有效性。 Conclusion: PTTSD在抑郁严重程度预测中实现了高精度与可解释性,其不确定性建模有助于提升临床决策支持系统的可信度和实用性。 Abstract: Accurate and interpretable predictions of depression severity are essential for clinical decision support, yet existing models often lack uncertainty estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time Series Depression Detection framework that predicts PHQ-8 scores from utterance-level clinical interviews while modeling uncertainty over time. PTTSD includes sequence-to-sequence and sequence-to-one variants, both combining bidirectional LSTMs, self-attention, and residual connections with Gaussian or Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated prediction intervals. Ablations confirm the value of attention and probabilistic modeling, while comparisons with MentalBERT establish generality. A three-part calibration analysis and qualitative case studies further highlight the interpretability and clinical relevance of uncertainty-aware forecasting.

[33] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

Surapon Nonesung,Teetouch Jaknamon,Sirinya Chaiophat,Natapong Nitarach,Chanakan Wittayasakpan,Warit Sirichotedumrong,Adisai Na-Thalang,Kunat Pipatanakul

Main category: cs.CL

TL;DR: ThaiOCRBench是首个针对泰语文本密集型视觉理解任务的综合基准,包含2,808个人工标注样本,涵盖13个任务类别,用于评估多模态模型在低资源、文字复杂场景下的表现。

Details Motivation: 现有视觉-语言模型基准主要关注高资源语言,泰语在文档结构理解等任务中缺乏代表性,亟需专门的评估基准。 Method: 构建了一个名为ThaiOCRBench的多样化、人工标注数据集,并在零样本设置下对多种前沿视觉-语言模型(包括闭源和开源)进行系统评估。 Result: 评估结果显示闭源模型(如Gemini 2.5 Pro)显著优于开源模型;开源模型在细粒度文本识别和手写内容提取上表现最差;错误分析揭示了语言偏见、结构不匹配和幻觉内容等关键挑战。 Conclusion: ThaiOCRBench为泰语等低资源语言提供了标准化的评估框架,揭示了当前模型的局限性,并为改进泰语文档理解提供了可操作的见解。 Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

[34] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar,Purvi Chaurasia,Sanchit Kabra,Ananya Srivastava,Vivek Gupta,Chandan K. Reddy

Main category: cs.CL

TL;DR: RUST-BENCH是一个新的基准测试,用于评估大语言模型在真实、复杂表格数据上的推理能力,涵盖规模、异构性、领域特异性和多跳推理复杂性。

Details Motivation: 现有表格推理基准主要测试小型、统一的表格,未能充分反映现实世界数据的复杂性,因此无法全面评估大语言模型的推理能力。 Method: 构建了一个包含7966个问题、来自2031个真实表格的数据集,覆盖NSF资助记录和NBA统计数据两个领域,评估模型在大规模、异构、领域特定且需要多跳推理的场景下的表现。 Result: 实验表明,当前开源和专有大语言模型在处理异构表结构和复杂多跳推理时表现不佳,暴露出架构和提示策略的持续弱点。 Conclusion: RUST-BENCH为推进表格推理研究提供了一个具有挑战性的新测试平台。 Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models' (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.

[35] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation

Cuong Huynh,Jie Cao

Main category: cs.CL

TL;DR: 本文提出了基于多轮简化的文本简化方法(MRS-Rule 和 MRS-Joint),利用 GPT-4o 生成,发现源CEFR与目标CEFR之间的差距显著影响简化效果,改进后的 MRS-Joint 表现更优。

Details Motivation: 发现提示词驱动的文本简化性能与源和目标CEFR等级之间的差距密切相关,因此探索多轮简化策略以提升效果。 Method: 提出两种多轮简化方法:基于规则的MRS-Rule和结合规则与LLM的MRS-Joint,均通过GPT-4o生成。 Result: 在TSAR-2025共享任务中,系统在20支队伍中排名第7;后续实验表明MRS-Joint以LLM简化结果为起点可进一步提升性能。 Conclusion: 多轮简化策略有效,尤其是MRS-Joint框架,验证了逐步简化和LLM结合的潜力。 Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.

[36] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering

Christos-Nikolaos Zacharopoulos,Revekka Kyriakoglou

Main category: cs.CL

TL;DR: 该研究使用大五人格量表-2(BFI-2)评估六种大语言模型在不同采样温度下的性格特征表达,发现其中四个维度存在显著差异,且神经质和外向性受温度影响较大;聚类分析显示模型架构可能影响其稳定的人格特征模式。

Details Motivation: 随着大语言模型(LLMs)在人类中心应用中的普及,理解其类人格行为对负责任地开发和部署AI系统至关重要。 Method: 采用大五人格量表-2(BFI-2)框架,系统评估六种LLMs在不同采样温度下的性格特征,并通过分层聚类分析模型间的相似性。 Result: 在五个性格维度中,四个表现出显著差异;神经质和外向性受采样温度影响明显;分层聚类揭示了基于架构特征的模型聚类模式,表明架构可能决定稳定的人格表达。 Conclusion: LLMs会表现出可测量的类人格特征,这些特征受模型架构和生成参数(如温度)影响,为模型调优、选择及AI伦理治理提供了新视角。 Abstract: As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1

[37] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao,Quoc Huy Pham,Subin Varghese,Silwal Saurav,Vedhus Hoskere

Main category: cs.CL

TL;DR: 本文提出了RAGalyst,一个自动化、与人类判断对齐的代理框架,用于严格评估特定领域的检索增强生成(RAG)系统。该框架通过生成高质量合成问答数据集并优化LLM-as-a-Judge指标,实现了与人类标注的高度相关性,并在军事、网络安全和桥梁工程三个领域验证了其有效性。

Details Motivation: 现有RAG评估框架依赖启发式指标或缺乏与人类判断一致性的LLM-as-a-Judge方法,难以捕捉专业安全关键领域中的细微差异,因此需要一种更可靠、可泛化的评估方法。 Method: 提出RAGalyst框架,包含代理式流水线来自动生成和过滤高质量合成问答数据集,并通过提示优化改进Answer Correctness和Answerability两个LLM-as-a-Judge指标,使其与人类标注对齐。 Result: 在三个不同领域(军事、网络安全、桥梁工程)中实验表明RAG性能高度依赖上下文,没有单一最优的嵌入模型、LLM或超参数配置;同时分析了导致答案正确性低的主要原因。 Conclusion: RAGalyst提供了一种系统化评估特定领域RAG系统的方法,帮助实践者发现领域内权衡,做出更明智的设计决策,从而构建更可靠有效的RAG系统。 Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

[38] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

Paloma Rabaey,Jong Hak Moon,Jung-Oh Lee,Min Gwan Kim,Hangyul Yoon,Thomas Demeester,Edward Choi

Main category: cs.CL

TL;DR: 提出了一种两部分框架来量化放射学报告中的显性和隐性不确定性,并发布了包含不确定性感知的Lunguage++数据集。

Details Motivation: 放射学报告中存在显性和隐性不确定性,影响自动化分析的准确性,现有方法难以有效处理这两类不确定性。 Method: 通过构建专家验证的LLM-based参考排序量化显性不确定性;利用专家定义的诊断路径扩展框架建模隐性不确定性。 Result: 成功构建了Lunguage++数据集,实现了对细粒度放射学报告的不确定性感知结构化。 Conclusion: 该框架有助于实现更准确的图像分类、可信的诊断推理,并推动对诊断不确定性临床影响的研究。 Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.

[39] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Amir Zur,Atticus Geiger,Ekdeep Singh Lubana,Eric Bigelow

Main category: cs.CL

TL;DR: 该论文研究了语言模型在生成文本时是否在其隐藏激活中表示可能的推理路径,并发现激活干预的有效性与模型的不确定性相关,表明模型隐式地表示了可能路径的空间。

Details Motivation: 量化语言模型在生成过程中的不确定性具有挑战性,因为不同的token选择可能导致完全不同的推理路径。本文旨在探究模型的隐藏激活是否能反映这些潜在的备选路径。 Method: 通过分析语言模型在链式思维推理过程中的隐藏激活状态,使用激活控制和预测技术来测试模型在不同token上的不确定性与其可引导性之间的关系。 Result: 实验发现模型在不同token上的不确定性与其激活被操控后的可引导性之间存在明显相关性;同时,隐藏激活能够预测模型未来的输出分布,说明模型隐式地表征了可能的推理路径空间。 Conclusion: 语言模型的隐藏激活不仅反映了当前的推理状态,还编码了尚未确定的多种可能路径,为理解和控制模型的不确定性提供了新途径。 Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.

[40] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

Kaveh Eskandari Miandoab,Katharine Kowalyshyn,Kabir Pamnani,Anesu Gavhera,Vasanth Sarathy,Matthias Scheutz

Main category: cs.CL

TL;DR: IntelliProof是一个利用大语言模型(LLM)分析议论文的交互式系统,将文章结构化为论证图,强调用户体验,提供可视化、分类解释和连贯性量化指标。

Details Motivation: 现有自动作文评分系统缺乏对论证结构的深入分析和用户理解之间的桥梁,IntelliProof旨在通过结构化表示和交互式可视化提升议论文分析的质量与可解释性。 Method: 将议论文建模为论证图,节点表示主张,边表示支持或攻击关系,使用LLM进行关系分类与评分,并提供可视化界面和自然语言工具辅助理解。 Result: 系统能够生成论证图的可视化结果,提供分类依据和连贯性评分,支持快速评估论点质量并保留人工监督。 Conclusion: IntelliProof有效结合了LLM的能力与人机交互设计,提升了议论文结构分析的透明度与可用性,有助于教育等领域的实际应用。 Abstract: We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text. A live demo and the system are available here to try: \textbf{https://intelliproof.vercel.app}

[41] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

Cyril Vallez,Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic

Main category: cs.CL

TL;DR: 本文研究了基于大语言模型(LLM)的编程助手在软件开发中产生的安全漏洞问题,指出即使最新的开源模型在早期漏洞场景中仍存在风险,并提出了一种新的严重性度量标准Prompt Exposure(PE)和Model Exposure(ME)评分,以评估和缓解LLM生成漏洞的风险。

Details Motivation: 随着LLM在编程助手中的广泛应用,其生成代码的安全漏洞对网络安全构成日益严重的威胁。然而现有基准和改进方法对实际模型的影响尚不明确,亟需有效评估和缓解机制。 Method: 提出Prompt Exposure(PE)作为新度量指标,综合考虑漏洞严重性、生成概率及诱导漏洞的提示形式;并基于PE定义Model Exposure(ME)评分,用于衡量模型生成漏洞的整体风险水平。 Result: 发现当前主流开源LLM在现实使用场景下仍易产生早期已知类型的漏洞,表明安全性与功能性的权衡阻碍了有效修复;PE和ME能够量化不同模型的漏洞暴露程度。 Conclusion: 现有的LLM编码助手仍存在显著安全风险,需通过PE和ME等新指标推动更有效的漏洞缓解策略,平衡安全性与功能性。 Abstract: As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.

[42] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

Sadia Sultana,Saiyma Sittul Muna,Mosammat Zannatul Samarukh,Ajwad Abrar,Tareque Mohmud Chowdhury

Main category: cs.CL

TL;DR: 本文提出了首个大规模孟加拉语生物医学多选题数据集BanglaMedQA和BanglaMMedBench,并评估了多种检索增强生成(RAG)策略,其中基于代理的RAG方法结合教科书与网络检索,在GPT-120B上达到89.54%的准确率,显著提升了孟加拉语医学问答系统的性能。

Details Motivation: 低资源语言中的生物医学问答系统发展受限,难以实现公平获取可靠的医疗知识,因此需要针对此类语言构建高质量的评测数据集并提升问答准确性。 Method: 构建了两个孟加拉语生物医学多选题数据集,采用多种RAG策略(包括传统、零样本回退、代理式、迭代反馈和聚合RAG),结合教材文本OCR构建的语料库与网页检索,通过生成式推理提升事实准确性,并设计动态选择检索与推理策略的代理式RAG框架。 Result: 实验结果显示,代理式RAG在openai/gpt-oss-120b模型上取得了89.54%的最高准确率,优于其他配置,且生成的推理理由质量更优。 Conclusion: 基于RAG的方法能有效提升孟加拉语医学问答系统的可靠性与可及性,所提出的数据集与方法为多语言医疗AI研究奠定了基础。 Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.

[43] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

Alamgir Munir Qazi,John P. McCrae,Jamal Abdul Nasir

Main category: cs.CL

TL;DR: 本文提出了一种名为DeReC的轻量级事实验证框架,通过结合密集检索与专门分类,使用通用文本嵌入替代自回归大语言模型,实现了更高的准确性和显著降低的计算开销。

Details Motivation: 现有的基于大语言模型的事实验证方法存在计算成本高和幻觉风险的问题,难以在实际场景中部署,因此需要一种更高效且可靠的方法。 Method: 提出DeReC框架,利用通用文本嵌入进行密集检索,并结合专门分类器进行判断,避免生成式推理,从而提升效率和准确性。 Result: DeReC在RAWFC数据集上F1得分为65.58%,超过当前最优方法L-Defense(61.20%);运行时间相比生成解释的LLM减少95%(从454分钟降至23分36秒),在LIAR-RAW上减少92%。 Conclusion: 精心设计的基于检索的系统可以在特定任务上达到甚至超越大语言模型的性能,同时具备更强的实用性,适合现实部署。 Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

[44] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Mohammad Atif Quamar,Mohammad Areeb

Main category: cs.CL

TL;DR: LEASH是一种无需训练的解码算法,通过监控token级熵的斜率和顶级logit margin的改进来自适应停止推理生成,在减少30-35%的token使用和27%的延迟的同时保持可接受的准确率下降。

Details Motivation: 传统的思维链(CoT)提示在大型语言模型中进行复杂推理时会生成固定长度的推理过程,导致计算资源浪费、token使用过多和延迟增加。 Method: 提出LEASH:基于logit熵的自适应停止启发式算法,监控两个内在信号——token级熵的斜率和顶级logit margin的提升,并在两个信号趋于平稳时终止生成。 Result: 在GSM8K和AQuA-RAT基准上的四个指令调优模型中,LEASH平均减少了30-35%的token生成量和27%的延迟,相对于CoT仅带来10个百分点的准确率下降。 Conclusion: LEASH是一种模型无关、无需额外训练或监督的简单高效替代方案,可用于优化语言模型中的推理生成过程。 Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.

cs.CV [Back]

[45] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices

Hyunseok Kwak,Kyeongwon Lee,Jae-Jin Lee,Woojoo Lee

Main category: cs.CV

TL;DR: LoRA-Edge是一种基于张量分解的参数高效微调方法,用于边缘设备上的CNN模型适应,保持推理成本不变的同时大幅减少可训练参数。

Details Motivation: 在资源受限的边缘设备上,传统全量微调因内存、计算和能耗限制而不可行,需要一种高效的微调方法应对域偏移问题。 Method: 提出LoRA-Edge,利用TT-SVD对预训练卷积层进行分解,仅选择性更新输出侧核心并零初始化,将更新融合回原始卷积核,保持推理不变。 Result: 相比全量微调,可训练参数减少高达两个数量级,在多个HAR数据集和CNN主干上精度损失控制在4.7%以内,且训练收敛速度提升1.4-3.8倍。 Conclusion: LoRA-Edge实现了结构对齐、参数高效的CNN模型在设备端微调,使边缘平台上的模型自适应更加实用。 Abstract: On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

[46] SILVI: Simple Interface for Labeling Video Interactions

Ozan Kanbertay,Richard Vogg,Elif Karakoc,Peter M. Kappeler,Claudia Fichtel,Alexander S. Ecker

Main category: cs.CV

TL;DR: SILVI是一个开源的标注软件,旨在整合动物行为和互动的视频标注功能,填补现有工具在个体定位与交互捕捉之间的空白。

Details Motivation: 现有的开源标注工具无法同时支持个体行为标注和交互捕捉,限制了对社会性和个性化动物行为的理解。 Method: 开发了一个集成行为标注和个体定位功能的开源软件SILVI,能够在视频数据中直接标注行为和互动,并生成适合训练和验证计算机视觉模型的结构化输出。 Result: SILVI成功实现了行为与交互的联合标注,促进了基于计算机视觉的细粒度行为分析,且具有扩展至人类互动视频标注的潜力。 Conclusion: SILVI桥接了行为生态学与计算机视觉,为自动化精细行为分析提供了有效工具,适用于动物及潜在的人类互动研究。 Abstract: Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions -- a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.

[47] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

Duong Mai,Lawrence Hall

Main category: cs.CV

TL;DR: 本研究探讨了在训练过程中引入基本噪声(如高斯、斑点、泊松和椒盐噪声)以提高深度学习模型在不同分布数据下的泛化能力,特别是在胸部X光片中检测COVID-19的应用中。

Details Motivation: 深度学习模型在面对来自不同设备或人群的外部分布(OOD)数据时,往往无法良好泛化,容易依赖训练数据中的源特异性伪影而非真正的生物标志物。因此,需要提升模型对分布偏移的鲁棒性。 Method: 在训练过程中注入多种基础噪声(高斯、斑点、泊松、椒盐噪声),以削弱模型对源特异性特征的依赖,增强其对OOD数据的适应能力。 Result: 该方法显著缩小了ID与OOD数据之间的性能差距,将AUC、F1、准确率、召回率和特异性等关键指标的差距从0.10-0.20降低至0.01-0.06(十次随机种子平均结果)。 Conclusion: 噪声注入是一种简单而有效的方法,可提升深度学习模型在医学图像识别任务中的跨分布泛化能力,尤其适用于COVID-19等临床紧迫场景。 Abstract: Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood

[48] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

Florence Klitzner,Blanca Inigo,Benjamin D. Killeen,Lalithkumar Seenivasan,Michelle Song,Axel Krieger,Mathias Unberath

Main category: cs.CV

TL;DR: 本文探讨了在双平面X光引导下进行套管插入手术中,基于模仿学习的机器人控制策略的应用潜力与挑战。研究团队开发了一个高度逼真的计算机模拟环境,并构建了包含正确操作轨迹和对应X光图像序列的数据集,用于训练仅依赖视觉信息进行迭代对齐的模仿学习策略。实验结果显示,该策略在68.5%的情况下首次尝试即成功,并能在不同椎体水平维持安全路径,且对复杂解剖结构(如骨折)具有良好的泛化能力。尽管模型仅在模拟数据上训练,但在真实X光图像上的测试也表现出合理的行为。然而,入口点精度仍存在局限,闭环控制需进一步优化反馈机制。未来结合更强的先验知识可能实现无需CT的轻量化术中脊柱导航。

Details Motivation: 由于多视角X光解读复杂,目前尚不清楚基于模仿学习的机器人控制策略是否适用于X光引导下的脊柱手术。本文旨在探索该方法在双平面X光引导套管插入中的可行性、优势与局限。 Method: 开发了一个高保真的计算机模拟沙箱环境,用于自动化仿真X光引导下的脊柱手术;构建了一个包含正确操作轨迹与对应双平面X光序列的数据集;在此基础上训练基于视觉输入的模仿学习策略,用于规划和开环控制套管的逐步对齐。 Result: 所提出的策略在68.5%的案例中首次尝试即成功,能够保持安全的椎弓内路径,适用于不同椎体水平,并对骨折等复杂解剖结构具有良好鲁棒性和泛化能力;在真实X光图像上的推演显示模型能生成合理的轨迹,尽管其仅在模拟数据上训练;但入口点定位精度仍有不足,且闭环控制需要更频繁的反馈机制。 Conclusion: 模仿学习在X光引导的脊柱手术中展现出潜力,可在无需CT的情况下实现轻量化的术中导航;当前方法在路径安全性与泛化性方面表现良好,但需改进入口点精度与闭环反馈设计;结合更强的领域先验知识有望推动临床应用。 Abstract: Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

[49] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model

Abdulmumin Sa'ad,Sulaimon Oyeniyi Adebayo,Abdul Jabbar Siddiqui

Main category: cs.CV

TL;DR: 提出一种基于轻量级YOLOv12和自对抗训练的实时废料检测框架,适用于无人机在沙漠等恶劣环境中高效、精准地检测各类废料。

Details Motivation: 传统废料收集方法在偏远或恶劣环境(如沙漠)中效率低且危险,现有研究多集中于城市环境和可回收物,忽视了有机和有害废料以及未充分探索的地形。 Method: 采用剪枝后的轻量级YOLOv12模型,结合自对抗训练(SAT)和专用数据增强策略,在DroneTrashNet数据集上进行训练与验证。 Result: 在精度、召回率和mAP上均有显著提升,同时具备低延迟和小模型尺寸,适合资源受限的无人机部署,并在与其他轻量级YOLO变体的对比中表现出更优的准确率与效率平衡。 Conclusion: 数据驱动与模型优化相结合的方法能有效提升沙漠环境中废料检测的鲁棒性和实时性,为偏远地区的自动化废料管理提供了可行方案。 Abstract: The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.

[50] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition

Hlali Azzeddine,Majid Ben Yakhlef,Soulaiman El Hazzat

Main category: cs.CV

TL;DR: 本文提出了一种基于类别的图像组合方法(Class-Based Image Composition),通过将同一类别的多张图像融合为复合输入图像(CoImg),提升小样本、类别不平衡数据下的深度学习诊断性能。

Details Motivation: 针对小规模、类别不平衡数据集以及低质量输入图像导致深度学习模型误判率高的问题,本文旨在提高训练样本的信息密度和类内方差,以增强模型对细微疾病模式的区分能力。 Method: 提出Class-Based Image Composition方法,将同类图像融合为3x1布局的复合图像,构建了类别平衡的数据集Co-OCTDL,并在VGG16模型上与原始数据集进行对比实验,保持模型结构和超参数一致。 Result: 在OCTDL数据集上,使用Co-OCTDL的模型准确率达到99.6%,F1分数为0.995,AUC为0.9996,显著优于原始数据集上的基线模型,且误判率明显降低。 Conclusion: 该方法能有效提升小样本和类别不平衡医学图像数据的诊断性能,生成高质量预测结果,具有在弱数据条件下应用的潜力。 Abstract: Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic results.The enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.

[51] I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging

Nand Kumar Yadav,Rodrigue Rizk,William CW Chen,KC Santosh

Main category: cs.CV

TL;DR: 提出一种无需标签、无需oracle的增量式正常样本扩展框架,用于医学图像中的未知异常检测,通过轻量级适配器更新和不确定性门控机制,有效提升检测性能。

Details Motivation: 由于医学图像中异常样本标注稀缺且专家监督成本高,传统方法难以有效检测未知异常,因此需要一种无需异常标签的无监督方法。 Method: 基于预训练视觉骨干网络,添加小型卷积适配器进行快速领域自适应;利用紧凑coreset存储特征,并结合z-score距离阈值和SWAG估计的认知不确定性双重门控机制,控制正常样本集的增量扩展。 Result: 在多个医学影像数据集上显著优于基线:COVID-CXR的ROC-AUC从0.9489提升至0.9982,Pneumonia CXR从0.6834升至0.8968,Brain MRI ND-5的ROC-AUC从0.6041增至0.7269,PR-AUC从0.7539增至0.8211。 Conclusion: 该框架能高效、稳定地扩展正常样本集,防止模型漂移和错误纳入异常样本,在标签稀缺的实际医疗场景中具有高应用价值。 Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.

[52] Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization

Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma

Main category: cs.CV

TL;DR: 本文提出了两种互补的方法:边界距离回归(BDR)和自适应时间细化(ATR),用于提升时序动作定位的精度与计算效率。BDR通过有符号距离回归实现更优的边界检测,ATR则通过连续深度选择自适应分配计算资源。

Details Motivation: 现有方法在处理不同难度的边界时采用统一计算,忽略了边界检测难度的显著差异,导致效率和精度受限。 Method: 提出边界距离回归(BDR)进行信息论最优定位,并引入自适应时间细化(ATR)实现端到端可微的计算分配;结合知识蒸馏降低训练成本。 Result: BDR在多种架构上带来1.8%至3.1%的mAP@0.7提升;ATR在THUMOS14上以更少18%的计算量实现2.9%的性能提升,短动作上提升达4.2%;轻量级学生模型保留99%性能。 Conclusion: BDR和ATR有效提升了时序动作定位的边界精度与计算效率,且可广泛适用于不同架构,在多个基准上经过严格验证。 Abstract: Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.

[53] Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

Zhejia Cai,Puhua Jiang,Shiwei Mao,Hongkun Cao,Ruqi Huang

Main category: cs.CV

TL;DR: 提出了一种联合优化网格几何和顶点颜色的框架,通过高斯引导的可微渲染实现几何与外观的一体化优化,提升了多视角图像三维重建质量,并支持重光照和形状编辑等下游任务。

Details Motivation: 现有方法通常将几何精度与外观渲染分离,不利于后续编辑任务,因此需要统一优化几何与外观以提升重建结果的可用性。 Method: 提出一种新框架,通过高斯引导的可微分渲染,结合输入图像的光度一致性以及法线和深度图的几何正则化,同步优化网格顶点位置、面片结构和顶点颜色。 Result: 实现了高质量的三维重建,兼顾几何准确性与外观真实感,并在重光照、形状变形等编辑任务中表现出良好的适用性。 Conclusion: 该方法实现了几何与外观的无缝联合优化,为多视角图像的三维重建提供了更适用于下游编辑任务的解决方案。 Abstract: Reconstructing real-world objects from multi-view images is essential for applications in 3D editing, AR/VR, and digital content creation. Existing methods typically prioritize either geometric accuracy (Multi-View Stereo) or photorealistic rendering (Novel View Synthesis), often decoupling geometry and appearance optimization, which hinders downstream editing tasks. This paper advocates an unified treatment on geometry and appearance optimization for seamless Gaussian-mesh joint optimization. More specifically, we propose a novel framework that simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency from input images and geometric regularization from normal and depth maps. The obtained high-quality 3D reconstruction can be further exploit in down-stream editing tasks, such as relighting and shape deformation. The code will be publicly available upon acceptance.

[54] A Linear Fractional Transformation Model and Calibration Method for Light Field Camera

Zhong Chen,Changfeng Chen

Main category: cs.CV

TL;DR: 提出一种基于线性分数变换参数α的光场相机内参标定方法,通过解耦主镜头与微透镜阵列,结合解析解与非线性优化,提升了标定精度与仿真速度。

Details Motivation: 准确标定光场相机的内部参数是3D重建的关键,但传统方法难以有效解耦主镜头与微透镜阵列的影响,导致标定精度受限。 Method: 引入线性分数变换参数α来解耦主镜头和微透镜阵列(MLA),采用基于最小二乘的解析解,并进行后续非线性优化;同时提出了从原始图像中检测特征的方法。 Result: 在真实和模拟数据上的实验验证了该方法的有效性,标定精度高,且基于该模型的光场图像仿真速度更快,有利于数据驱动的深度学习方法。 Conclusion: 所提出的方法能够高效准确地标定光场相机内参,提升仿真效率,为光场成像与深度学习的结合提供了有力支持。 Abstract: Accurate calibration of internal parameters is a crucial yet challenging prerequisite for 3D reconstruction using light field cameras. In this paper, we propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled the main lens and micro lens array (MLA). The proposed method includes an analytical solution based on least squares, followed by nonlinear refinement. The method for detecting features from the raw images is also introduced. Experimental results on both physical and simulated data have verified the performance of proposed method. Based on proposed model, the simulation of raw light field images becomes faster, which is crucial for data-driven deep learning methods. The corresponding code can be obtained from the author's website.

[55] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images

Sam Bahrami,Dylan Campbell

Main category: cs.CV

TL;DR: 提出合成数据集Room Envelopes,用于监督单目几何估计器预测可见表面和结构布局表面,实现对场景范围及物体形状位置的理解。

Details Motivation: 现有场景重建方法无法恢复被遮挡的表面,而场景的结构元素(如墙、地板、天花板)虽重要但研究较少;这些元素通常具有平面性、重复性和简单性,较易预测。 Method: 构建名为Room Envelopes的合成数据集,包含RGB图像及两个对应的点图:一个表示可见表面,另一个表示去除装饰物后的结构布局表面,从而为前馈单目几何估计器提供直接监督。 Result: 该数据集支持对单目几何估计模型的直接监督训练,使其能够同时预测可见表面和结构布局表面,提升对完整场景结构的理解能力。 Conclusion: 通过引入Room Envelopes数据集,可有效推动对室内场景结构布局的完整重建,为生成完整3D场景提供了可行且低成本的解决方案。 Abstract: Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset -- Room Envelopes -- that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene's extent, as well as the shape and location of its objects.

[56] Simple 3D Pose Features Support Human and Machine Social Scene Understanding

Wenshuo Qin,Leyla Isik

Main category: cs.CV

TL;DR: 该研究发现人类通过3D姿态信息进行社交互动判断,提出简洁的3D社交姿态特征(如面部位置和朝向)即可有效预测人类判断,并能提升现有AI视觉模型的性能,表明显式的3D姿态表征对社交场景理解至关重要。

Details Motivation: 理解人类如何从视觉输入中快速提取社交互动信息,并揭示当前AI视觉系统在识别社交互动方面的不足,探索3D姿态信息在其中的关键作用。 Method: 结合先进的姿态和深度估计算法,从视频片段中提取人物的3D关节位置,构建简化的3D社交姿态特征(仅包含面部位置和方向),并与现有AI视觉模型进行比较,评估其对人类社交判断的预测能力。 Result: 3D关节位置的表现优于大多数现有AI视觉模型;简化的3D社交姿态特征与完整3D关节点具有相当的预测力,并显著提升现成AI模型的性能;模型中3D社交姿态特征的表达程度与其匹配人类判断的能力正相关。 Conclusion: 人类对社交场景的理解依赖于显式的3D姿态表征,且可通过简单的结构化视觉空间原语实现,这对改进AI系统的社交理解能力具有重要启示。 Abstract: Humans can quickly and effortlessly extract a variety of information about others' social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model's ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.

[57] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation

Yuwen Tao,Kanglei Zhou,Xin Tan,Yuan Xie

Main category: cs.CV

TL;DR: 提出CaRF框架,通过在3D高斯空间中直接操作实现多视角一致的指代表达分割,引入GFCE和ITPVS方法,在多个基准上显著超越现有方法。

Details Motivation: 现有方法依赖2D渲染伪监督和视图特定特征学习,导致跨视角不一致问题,难以实现准确的3D语言指代分割。 Method: 提出Camera Aware Referring Field (CaRF),包含Gaussian Field Camera Encoding (GFCE) 将相机几何融入高斯文本交互,以及In Training Paired View Supervision (ITPVS) 在训练中对齐校准视图间的高斯logits。 Result: 在Ref LERF、LERF OVS和3D OVS三个基准上,mIoU分别平均提升16.8%、4.3%和2.0%,显著优于现有方法。 Conclusion: CaRF实现了更可靠且视角一致的3D场景理解,有助于推动具身AI、AR/VR交互和自主感知的发展。 Abstract: Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.

[58] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

Peiyao Wang,Weining Wang,Qi Li

Main category: cs.CV

TL;DR: 本文提出PhysCorr框架,通过PhysicsRM奖励模型和PhyDPO优化方法提升视频生成中的物理一致性,在保持视觉质量的同时显著增强物理真实性。

Details Motivation: 现有文本到视频生成模型在物理合理性方面存在缺陷,如物体运动不真实、交互不连贯,限制了其在具身AI、机器人和仿真等领域的应用。 Method: 提出PhysicsRM——首个量化物体内部稳定性与物体间交互的双维度奖励模型,并基于此构建PhyDPO偏好优化框架,结合对比反馈与物理感知重加权机制,实现对生成过程的引导。 Result: 在多个基准上实验表明,PhysCorr显著提升了生成视频的物理真实感,同时保持了良好的视觉质量和语义一致性。 Conclusion: PhysCorr为实现物理可信的视频生成提供了有效且可扩展的解决方案,推动了视频生成技术在需要物理合理性的实际场景中的应用。 Abstract: Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.

[59] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Mahmoud Soliman,Omar Abdelaziz,Ahmed Radwan,Anand,Mohamed Shehata

Main category: cs.CV

TL;DR: 提出GNN-MoE方法,结合图神经网络路由与Mixture-of-Experts框架,利用Kronecker适配器实现高效参数微调,提升Vision Transformer在未见域上的泛化性能。

Details Motivation: 解决标准微调成本高且损害泛化能力的问题,提升预训练ViT在域泛化中的参数效率和鲁棒性。 Method: 设计基于图神经网络(GNN)的路由器(如GCN、GAT、SAGE),在块间图上操作,动态将图像块分配给专业化专家,采用Kronecker适配器实现高效参数微调。 Result: 在多个域泛化基准上达到最先进或具有竞争力的性能,同时保持高参数效率。 Conclusion: 图结构上下文感知路由能有效提升域泛化中ViT的适应能力,GNN-MoE为轻量级、鲁棒的域泛化提供了新思路。 Abstract: Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.

[60] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Mahmoud Soliman,Islam Osman,Mohamed S. Shehata,Rasika Rajapakshe

Main category: cs.CV

TL;DR: MedDChest是一种专为胸部影像设计的新型视觉Transformer模型,通过在大规模医学图像上从零开始预训练,并结合内容感知的数据增强策略,显著优于ImageNet预训练模型。

Details Motivation: 现有视觉模型在医学影像中表现受限,主要因为其基于自然图像预训练的骨干网络与医学图像存在领域差异。 Method: 提出MedDChest模型,从零开始在超过120万张多模态胸部影像(X光和CT)上进行预训练,并引入一种名为Guided Random Resized Crops的内容感知数据增强方法,聚焦解剖学相关区域。 Result: 在多种下游诊断任务上微调后,MedDChest显著优于现有的ImageNet预训练模型,验证了其作为胸部影像特征提取器的有效性和鲁棒性。 Conclusion: 大规模、领域内预训练结合领域特定的数据增强策略能有效提升医学影像分析性能,MedDChest为胸部诊断任务提供了更优的起点,模型权重将公开以促进后续研究。 Abstract: The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

[61] Near-Lossless 3D Voxel Representation Free from Iso-surface

Yihao Luo,Xianglong He,Chuanyu Pan,Yiwen Chen,Jiaqi Wu,Yangguang Li,Wanli Ouyang,Yuanming Hu,Guang Yang,ChoonHwai Yap

Main category: cs.CV

TL;DR: 提出了一种名为Faithful Contouring的稀疏体素化表示方法,支持高分辨率(2048+)任意网格的近无损表示与重建,无需依赖场函数转换或等值面提取,在几何保真度和效率上显著优于现有方法。

Details Motivation: 现有基于等值面的体素化表示方法依赖水密化或渲染优化,导致几何失真,难以准确表达复杂几何与拓扑结构,因此需要一种更高保真、更高效的3D网格表示方法。 Method: 提出Faithful Contouring,一种无需将网格转换为场函数或提取等值面的稀疏体素化表示;设计双模式自编码器,实现可扩展且细节保持的形状重建。 Result: 在直接表示中达到10^{-5}级别的距离误差;在网格重建中Chamfer Distance降低93%,F-score提升35%;有效保留尖锐特征和内部结构,支持高分辨率(2048+)。 Conclusion: Faithful Contouring在精度和效率上均优于现有方法,是一种高保真的3D网格表示方案,适用于3D重建与生成任务。 Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

[62] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals

Arfina Rahman,Mahesh Banavar

Main category: cs.CV

TL;DR: 提出了一种基于低帧率指尖视频提取PPG信号的轻量级生物特征认证框架,结合CVT-ConvMixer-LSTM混合深度学习模型,在CFIHSR数据集上实现了98%的认证准确率。

Details Motivation: PPG信号具有非侵入性、内源性活体检测和适用于低成本可穿戴设备的优势,但在实际应用中受运动伪影、光照变化和个体间生理差异影响,需提升信号鲁棒性和认证性能。 Method: 使用14 Hz低采样率的CFIHSR数据集,对PPG信号进行去基线漂移、PCA去运动伪影、带通滤波、傅里叶重采样和幅值归一化;通过连续小波变换(CWT)将一维PPG信号转换为二维时频谱图,并采用CVT-ConvMixer-LSTM混合模型提取时空特征进行身份认证。 Result: 在46名受试者上的实验结果显示认证准确率达到98%,模型对噪声和个体间变异性具有强鲁棒性。 Conclusion: 所提框架轻量高效,具备良好的可扩展性和活体检测能力,适合用于移动和嵌入式生物特征安全应用。 Abstract: Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.

[63] Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment

Zehui Feng,Chenqi Zhang,Mingru Wang,Minuo Wei,Shiwei Cheng,Cuntai Guan,Ting Han

Main category: cs.CV

TL;DR: 提出Bratrix,首个实现语言锚定视觉-脑对齐的端到端框架,通过解耦视觉刺激并引入不确定性感知模块,在EEG、MEG和fMRI任务中显著提升检索、重建和描述性能。

Details Motivation: 现有方法直接对齐神经活动与视觉嵌入,但缺乏对潜在语义维度的捕捉,导致可解释性和鲁棒性不足。 Method: Bratrix将视觉刺激解耦为层次化的视觉和语言语义成分,并将视觉与脑信号映射到共享隐空间;引入不确定性感知模块进行加权对齐,采用语言锚定语义矩阵增强跨模态关联,并通过两阶段训练策略(单模态预训练+多模态微调)优化对齐精度。 Result: 在EEG、MEG和fMRI基准上,Bratrix在检索、重建和图像描述任务中优于现有最先进方法,200类EEG检索任务性能提升14.3%。 Conclusion: Bratrix通过语言锚定的多模态对齐机制有效解耦神经信号中的视觉语义,提升了跨模态对齐的精度与鲁棒性,为脑信号解码提供了更可解释的框架。 Abstract: Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.

[64] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: 本文研究了在无配对和自监督条件下CT图像去噪的两种高效训练方法:基于CycleGAN的残差翻译器和Noise2Score(N2S)得分匹配去噪器。实验表明,CycleGAN在最终图像质量上表现最佳,而Noise2Score在缺乏干净配对数据时仍能实现显著去噪效果,是一种稳健且性能良好的替代方案。

Details Motivation: 在缺乏配对训练数据的情况下,如何有效提升CT图像去噪性能是一个关键挑战。本文旨在评估两种无需配对数据的先进去噪范式:CycleGAN与Noise2Score,以探索其在真实场景中的适用性与优势。 Method: 采用CycleGAN-based残差翻译器和Noise2Score得分匹配模型进行CT图像去噪。通过统一评估协议对CycleGAN进行参数扫描(如lambda_cycle、lambda_iden、网络宽度等),确定最优配置,并使用U-Net作为骨干网络进行充分训练。Noise2Score则利用分数匹配机制在无配对数据下学习噪声分布。 Result: 选定的CycleGAN将输入图像从34.66 dB / 0.9234 SSIM提升至38.913 dB / 0.971 SSIM,在Kaggle未见数据集上取得1.9343分;Noise2Score虽在PSNR/SSIM上略低,但在极噪输入下表现出显著改善,显示其在无配对场景下的有效性。 Conclusion: CycleGAN在最终去噪质量上优于Noise2Score,是当前最优选择;而Noise2Score作为一种无需配对数据的自监督方法,性能接近且更具鲁棒性,适用于无法获取干净配对图像的实际医疗场景。 Abstract: We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.

[65] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation

Nishchal Sapkota,Haoyan Shi,Yejia Zhang,Xianshi Ma,Bofang Zheng,Danny Z. Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为UKAST的新型医学图像分割架构,结合了Swin Transformer与基于有理函数的Kolmogorov-Arnold Networks(KANs),在减少计算量的同时实现了更高的数据效率和性能,尤其在标注数据稀缺的情况下表现优异。

Details Motivation: 医学图像分割面临复杂解剖结构和标注数据有限的挑战,传统CNN难以捕捉长距离依赖,而Transformer虽能建模全局上下文但数据需求高、计算成本大。因此,需要一种更高效且数据友好的模型。 Method: 提出UKAST,一种类似U-Net的架构,将基于有理函数的Kolmogorov-Arnold Networks(GR-KANs)集成到Swin Transformer编码器中,利用KAT中的有理基函数提升表达能力和数据效率,降低FLOPs,仅轻微增加参数数量。 Result: 在四个2D和3D医学图像分割基准上达到最先进的性能,显著优于CNN和Transformer基线方法,尤其在数据稀缺场景下表现出更强的鲁棒性和准确性。 Conclusion: KAN增强的Transformer架构(UKAST)在医学图像分割中展现出卓越的数据效率和计算优势,为解决Transformer数据饥渴问题提供了有效路径,具有广泛应用前景。 Abstract: Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST

[66] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis

Biao Liu,Yuanzhi Liang

Main category: cs.CV

TL;DR: 提出了一种名为SpatialLock的新框架,用于提高文本到图像生成中对象定位的精确性,通过结合感知信号和定位信息来联合控制空间位置的生成。

Details Motivation: 现有的文本到图像生成方法在充分利用位置信息方面存在不足,导致对对象空间布局的理解不够充分。 Method: SpatialLock包含两个组件:位置参与注入(PoI)和位置引导学习(PoG)。PoI通过注意力层直接整合空间信息,促使模型有效学习定位信息;PoG则采用基于感知的监督进一步优化对象定位。 Result: 实验表明,SpatialLock在精确对象定位方面达到了新的最先进水平,在多个数据集上实现了超过0.9的IOU分数,并提高了生成图像的视觉质量。 Conclusion: SpatialLock能够有效地提升文本到图像生成过程中对象的空间排列精度及图像的视觉质量。 Abstract: Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.

[67] Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration

Yunghee Lee,Byeonghyun Pak,Junwha Hong,Hoseong Kim

Main category: cs.CV

TL;DR: 提出了一种无需训练的加速扩散采样方法Tortoise and Hare Guidance (THG),通过多速率ODE系统减少计算量,在保持生成质量的同时显著降低函数求值次数。

Details Motivation: 现有的扩散模型采样方法在保持高保真生成的同时效率较低,且传统求解器未能充分利用指导分支中的冗余性。 Method: 将分类器无关指导(CFG)的ODE重写为多速率ODE系统,分析噪声估计和附加指导项对数值误差的不同敏感性;THG在细粒度时间步上更新噪声估计(龟方程),在粗粒度时间步上更新附加指导项(兔方程),并引入误差感知的时间步采样器和指导尺度调度器。 Result: THG最多减少了30%的函数求值次数(NFE),生成质量几乎无损(ΔImageReward ≤ 0.032),在相同计算预算下优于当前最先进的无需训练的CFG加速方法。 Conclusion: 多速率公式在扩散求解器中具有巨大潜力,可在无需模型重训练的情况下实现高效、高质量的实时图像合成。 Abstract: In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.

[68] Text to Sketch Generation with Multi-Styles

Tengjie Li,Shikui Tu,Lei Xu

Main category: cs.CV

TL;DR: 提出一种无需训练的扩散模型框架,通过文本提示和参考风格草图实现显式的风格控制,有效减少内容泄露并提升生成质量,支持多风格可控生成。

Details Motivation: 现有草图生成方法缺乏对风格的精确控制,难以在保持内容独立的同时实现多样化风格迁移。 Method: 基于扩散模型,引入线性平滑融合参考特征,并设计风格-内容引导机制;通过联合AdaIN模块整合多个参考草图特征以实现多风格生成。 Result: 实验表明该方法在风格对齐准确性、生成质量和低结构相似性场景下的表现优于现有方法,支持灵活的多风格控制。 Conclusion: 所提框架实现了高质量、高灵活性的草图风格生成,无需训练即可有效分离风格与内容,推动了可控视觉生成的发展。 Abstract: Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.

[69] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)

Venkata Manikanta Desu,Syed Fawaz Ali

Main category: cs.CV

TL;DR: 本研究提出了一套基于深度学习的自动化网球比赛分析流程,集成了球员与球体检测、球场关键点识别等功能,可实时生成标注视频和详细数据指标。

Details Motivation: 为了提升网球比赛分析的自动化水平,提供更精确的运动表现评估工具,支持教练员、转播方和运动员进行战术分析与训练优化。 Method: 采用YOLOv8进行球员检测,自定义训练的YOLOv5模型用于网球追踪,结合ResNet50架构实现球场关键点检测,并融合多模型输出以生成时空分析结果。 Result: 系统在不同场地条件和比赛场景下均表现出良好的鲁棒性,能够准确提取球员移动模式、球速、击球精度和反应时间等关键指标。 Conclusion: 该框架为网球比赛提供了高效、准确的自动化分析方案,具备实际应用价值,可用于体育训练、赛事转播和运动科学研究。 Abstract: This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.

[70] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms

Shengyu Tang,Zeyuan Lu,Jiazhi Dong,Changdong Yu,Xiaoyu Wang,Yaohui Lyu,Weihao Xia

Main category: cs.CV

TL;DR: 提出一种高效的双分支海上多目标跟踪方法DMSORT,结合检测重识别与相机运动估计分支,在复杂海况下实现鲁棒的船舶跟踪,具有高精度、强鲁棒性和实时性。

Details Motivation: 复杂的海上环境导致相机运动和视觉退化,严重影响现有海上多目标跟踪(MOT)性能,需提升在抖动、遮挡等干扰下的稳定性和身份一致性。 Method: 设计双分支并行跟踪框架:一为带可逆列状检测网络(RCDN)和轻量Transformer外观提取器(Li-TAE)的检测与重识别分支;另一为基于投影变换估计平台运动并在卡尔曼滤波中补偿的动态相机运动估计分支;并通过聚类优化的特征融合模块融合运动与外观线索。 Result: 在新加坡海事数据集上达到最先进性能,是现有基于ReID的MOT方法中运行速度最快的方法,同时保持高身份一致性和对抖动、遮挡的鲁棒性。 Conclusion: DMSORT通过双分支结构有效应对海上相机运动带来的挑战,在保证实时性的同时显著提升了海上多目标跟踪的准确性和稳定性,适用于实际航海安全与监控应用。 Abstract: Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.

[71] Learning from Online Videos at Inference Time for Computer-Use Agents

Yujian Liu,Ze Wang,Hao Chen,Ximeng Sun,Xiaodong Yu,Jialian Wu,Jiang Liu,Emad Barsoum,Zicheng Liu,Shiyu Chang

Main category: cs.CV

TL;DR: 本文提出了一种使计算机使用代理在推理时从在线视频中学习的框架,通过检索、过滤教程视频并将其转化为结构化示范轨迹,动态选择最佳轨迹作为上下文指导,从而提升代理执行复杂任务的能力。

Details Motivation: 现有的计算机使用代理在需要特定领域程序性知识的任务上仍落后于人类,而人类可通过观看视频教程快速学习。因此,研究如何让代理在推理时有效利用在线视频进行学习具有重要意义。 Method: 提出一个包含视频检索与过滤、将视频转换为结构化动作轨迹、并通过基于视觉语言模型(VLM)的两阶段选择机制动态提供上下文指导的框架。利用VLM推断UI操作,分割视频为短动作序列,并为每个子序列分配文本目标。 Result: 在两个广泛使用的基准测试上实验表明,该框架显著优于强基线代理及仅使用文本教程或转录本的变体,验证了轨迹分割与选择、动作过滤和视觉信息的重要性。 Conclusion: 在线视频可被系统地提炼为可在推理时有效提升计算机使用代理性能的 actionable guidance,为代理的实时学习提供了新路径。 Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.

[72] Seeing Straight: Document Orientation Detection for Efficient OCR

Suranjan Goswami,Abhinav Ravi,Raja Kolla,Ali Faraz,Shaharukh Khan,Akash,Chandra Khatri,Shubham Agarwal

Main category: cs.CV

TL;DR: 本文提出了一种用于评估OCR对图像旋转鲁棒性的新基准OCR-Rotation-Bench(ORB),并构建了一个基于Phi-3.5-Vision模型的快速、鲁棒且轻量级的旋转分类流程,在ORB-En和ORB-Indic数据集上分别达到96%和92%的准确率,显著提升了OCR性能。

Details Motivation: 扫描或拍摄文档时常见的方向错误影响OCR等下游任务性能,现有方法在真实场景中的旋转校正仍存在挑战,因此需要一个专门的基准和高效准确的旋转分类方法。 Method: 构建了包含英语和11种印度语种的OCR-Rotation-Bench(ORB)基准;基于Phi-3.5-Vision视觉编码器,结合动态图像裁剪,设计了一个专用于四类旋转分类任务的轻量级流水线,并进行独立微调。 Result: 该方法在ORB-En和ORB-Indic数据集上的旋转识别准确率分别达到96%和92%;在模拟真实场景中,能显著提升闭源OCR模型性能最多14%,开源模型性能最多4倍。 Conclusion: 提出的ORB基准为评估OCR系统旋转鲁棒性提供了有效工具,所设计的轻量级旋转分类模块不仅精度高,且能显著增强各类OCR模型的实际表现,具有广泛应用价值。 Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.

[73] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology

Fatemehzahra Darzi,Rodrigo Escobar Diaz Guerrero,Thomas Bocklitz

Main category: cs.CV

TL;DR: 本研究探讨了不同颜色转换技术对HE染色图像与非线性多模态图像配准效果的影响,发现CycleGAN颜色转换在两种场景下均显著降低了配准误差,提升了数字病理学中的图像对齐精度。

Details Motivation: 提高不同成像模态间病理图像配准的准确性,以支持生物标志物分析和组织重建等应用。 Method: 采用20对组织样本数据集,比较CycleGAN、Macenko、Reinhard和Vahadane等颜色转换方法,并结合VALIS方法进行刚性与非刚性配准,评估使用相对目标配准误差(rTRE)及手动关键点评价。 Result: CycleGAN颜色转换在原始和反转多模态图像两种场景下均取得最低的中位配准误差(MMrTRE和AMrTRE),显著优于其他方法。 Conclusion: 预处理中应用颜色转换(尤其是CycleGAN)可有效改善跨模态病理图像配准精度,有助于提升数字病理分析的可靠性。 Abstract: Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (H&E) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.

[74] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification

Josef Mayr,Anna Reithmeir,Maxime Di Folco,Julia A. Schnabel

Main category: cs.CV

TL;DR: 该研究探讨了基于预训练视觉编码器(如DINOv2和MedSAM)提取的协方差描述符在医学图像分类中的有效性,结合SPDNet网络,在MedMNSIT基准的多个数据集上优于现有方法。

Details Motivation: 协方差描述符在通用计算机视觉中表现良好,但在医学影像中尚未充分探索,本文旨在评估其在传统和学习型医学图像分类中的性能。 Method: 从预训练的通用视觉编码器(GVEs)提取特征并构建协方差描述符,与手工设计的描述符进行比较,使用SPDNet网络处理对称正定矩阵进行分类。 Result: 基于GVE特征的协方差描述符始终优于手工特征;DINOv2结合SPDNet在多个数据集上达到最优性能。 Conclusion: 结合强大的预训练视觉编码器与协方差描述符具有提升医学图像分析性能的巨大潜力。 Abstract: Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.

[75] AStF: Motion Style Transfer via Adaptive Statistics Fusor

Hanmo Chen,Chenghao Xu,Jiexi Yan,Cheng Deng

Main category: cs.CV

TL;DR: 提出了一种新的自适应统计融合器(AStF),通过引入偏度和峰度来增强运动风格迁移的效果,相较于现有方法在动态风格的时空统计建模上表现更优。

Details Motivation: 传统基于均值和方差的运动风格迁移方法难以充分捕捉运动数据的复杂动态模式和时空一致性,因此需要更全面的统计建模方法。 Method: 提出了包含风格解耦模块(SDM)和高阶多统计注意力机制(HOS-Attn)的AStF模型,并结合运动一致性正则化(MCR)判别器进行训练。 Result: 实验结果表明,所提AStF在运动风格迁移任务中优于当前最先进的方法,能更好地建模动态风格中的时空统计模式。 Conclusion: 引入偏度和峰度作为高阶统计量可有效提升运动风格迁移的质量,AStF为运动风格建模提供了更全面且有效的框架。 Abstract: Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.

[76] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection

Marawan Elbatel,Anbang Wang,Keyuan Liu,Kaouther Mouheb,Enrique Almar-Munoz,Lizhuo Lin,Yanqi Yang,Karim Lekadir,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文研究了将人类中心的基础模型Sapiens迁移到医学图像解剖标志点检测任务中,提出了MedSapiens模型,在多数据集预训练后在多个基准上达到最先进性能,在平均成功检测率(SDR)上显著优于现有通用和专用模型,并在少样本场景下表现出优越的适应能力。

Details Motivation: 传统解剖标志点检测依赖于领域特定模型,而大规模预训练视觉模型的发展提供了新机遇。本文旨在探索人类中心基础模型(如用于姿态估计的Sapiens)在医学图像中的潜力,验证其作为强先验知识在解剖标志点检测中的有效性。 Method: 通过多数据集预训练的方式,将原本用于人体姿态估计的人类中心基础模型Sapiens迁移到医学图像的解剖标志点检测任务中,构建MedSapiens模型,并在多个公开医学影像数据集上进行评估,同时测试其在少样本情况下的表现。 Result: MedSapiens在平均成功检测率(SDR)上比现有通用模型最高提升5.26%,比专用模型最高提升21.81%;在少样本设置下,比当前最优方法提升2.69%。 Conclusion: 人类中心的基础模型经过适当迁移后可成为解剖标志点检测的强大基线,表明此类模型在医学图像分析中具有巨大潜力,且此前被严重低估。 Abstract: This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .

[77] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Claudio Giusti,Luca Guarnera,Sebastiano Battiato

Main category: cs.CV

TL;DR: 本文提出了一种名为Proto-LeakNet的可解释性AI图像与深度伪造溯源框架,利用扩散模型在潜在空间中的信号泄露特征,实现对已知和未知生成器的高效分类与开放集识别,无需重新训练且在后处理下保持鲁棒性。

Details Motivation: 随着合成图像和深度伪造技术日益复杂,传统溯源方法面临挑战。现有研究表明扩散模型会在输出中留下统计痕迹(信号泄露),但如何有效利用这些痕迹进行可靠且可解释的来源归因仍待解决。 Method: 提出Proto-LeakNet,通过在扩散模型的潜在空间中重新模拟部分前向扩散过程来暴露生成器特有的残留线索;采用时间注意力编码器聚合多步潜在特征,并设计特征加权原型头结构嵌入空间,结合闭集分类与基于密度的开集评估,实现在未见生成器上的零样本迁移分析。 Result: 在仅使用闭集数据训练的情况下,Proto-LeakNet达到98.13%的Macro AUC,性能超越现有最先进方法,嵌入空间在后处理操作下仍保持稳健,并展现出对已知与未知生成器的强分离能力。 Conclusion: 建模潜在空间中的信号泄露偏差能够实现可靠且可解释的AI生成图像与深度伪造检测,为未来数字内容认证提供了新方向。 Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.

[78] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Yujie Yang,Shuang Li,Jun Ye,Neng Dong,Fan Li,Huafeng Li

Main category: cs.CV

TL;DR: 提出DinoGRL框架,利用DINOv2驱动的步态表征学习,结合外观与步态特征,提升可见光-红外视频行人重识别性能。

Details Motivation: 现有方法多关注模态不变的外观特征,忽略了具有时序动态优势且模态不变的步态特征,限制了跨模态视频匹配中的时空一致性建模能力。 Method: 提出DinoGRL框架,包含语义感知轮廓与步态学习(SASGL)模型和渐进式双向多粒度增强(PBMGE)模块;SASGL利用DINOv2的语义先验增强轮廓和步态表示,PBMGE通过多粒度双向交互融合步态与外观特征。 Result: 在HITSZ-VCM和BUPT数据集上实验表明,该方法显著优于现有最先进方法,实现了更优的跨模态视频行人重识别性能。 Conclusion: 所提出的DinoGRL框架有效融合步态与外观特征,提升了跨模态视频行人重识别的鲁棒性和判别性,验证了利用DINOv2先验进行步态表征学习的有效性。 Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

[79] FastGS: Training 3D Gaussian Splatting in 100 Seconds

Shiwei Ren,Tianci Wen,Yongchun Fang,Biao Lu

Main category: cs.CV

TL;DR: 本文提出了FastGS,一种新颖、简单且通用的3D高斯点阵加速框架,基于多视角一致性设计了密度化与剪枝策略,显著提升了训练速度并保持良好的渲染质量。

Details Motivation: 现有的3D高斯点阵加速方法在训练过程中无法有效控制高斯数量,导致计算冗余和时间开销过大。 Method: 提出基于多视角一致性的高斯重要性评估机制,设计无需预算机制的密度化与剪枝策略,动态优化高斯分布。 Result: 在Mip-NeRF 360数据集上实现3.32倍加速,在Deep Blending数据集上相比原始3DGS达到15.45倍加速,同时保持可比的渲染质量,并在多种任务中展现2-7倍的加速效果。 Conclusion: FastGS通过多视角一致性驱动的优化策略,有效平衡了训练效率与渲染质量,具有广泛的应用通用性。 Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/

[80] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment

Leire Benito-Del-Valle,Artzai Picón,Daniel Mugica,Manuel Ramos,Eva Portillo,Javier Romero,Carlos Javier Jimenez,Ramón Navarra-Mestre

Main category: cs.CV

TL;DR: 本研究通过在大规模农业数据集上自监督学习,将通用视觉基础模型适应于除草剂田间试验表型分析,显著提升了植物种类识别与损伤分类的准确性,尤其在跨域场景下表现更优,并减少了80%的标注样本需求。

Details Motivation: 通用视觉模型在农业领域应用时,因物种和损伤类型的细粒度差异而性能受限,难以满足除草剂田间试验中对精确识别和损伤评估的需求。 Method: 采用自监督学习方法,在一个大规模、经过整理的农业数据集上对通用视觉基础模型进行领域特定预训练,以学习适用于除草剂试验图像的丰富且可迁移的特征表示。 Result: 领域特定模型在物种识别(F1从0.91提升至0.94)和损伤分类(从0.26至0.33)上均显著优于通用模型;在未见条件下提升更大;在无人机图像等域偏移场景中也保持更强性能;同时在少样本标注下分割精度更高,仅用20%标注数据即可超越通用模型。 Conclusion: 领域特定基础模型具有更强的泛化能力,可显著减少人工标注工作量,为除草剂田间试验提供可扩展、自动化的分析方案。 Abstract: Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis.

[81] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data

Robin Spanier,Thorsten Hoeser,Claudia Kuenzer

Main category: cs.CV

TL;DR: 本研究利用合成与真实Sentinel-1卫星影像训练YOLOv10模型,提升 offshore基础设施检测性能,特别是在样本稀缺情况下,验证了模型在未见区域的良好泛化能力。

Details Motivation: 由于海上基础设施种类、形状和尺寸的样本稀缺且数据不平衡,传统检测模型表现受限,亟需提升模型的泛化性与数据均衡性。 Method: 结合合成与真实Sentinel-1卫星图像训练YOLOv10目标检测模型,并在三个未参与训练的区域(墨西哥湾、北海、波斯湾)进行跨区域验证。 Result: 共检测到3,529个海上平台,模型F1分数从0.85提升至0.90,证明合成数据有效改善类别不平衡并增强模型性能。 Conclusion: 合成数据是应对遥感中样本不足和类别不平衡的有效策略,有助于实现可扩展、全球适用的海上基础设施监测。 Abstract: The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.

[82] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation

Xiangjun Zhang,Litong Gong,Yinglin Zheng,Yansong Liu,Wentao Jiang,Mingyi Xu,Biao Wang,Tiezheng Ge,Ming Zeng

Main category: cs.CV

TL;DR: 提出RISE-T2V框架,通过集成提示重写与语义特征提取,提升文本到视频生成模型对用户意图的理解和生成质量。

Details Motivation: 现有T2V模型依赖预训练文本编码器,但对简短提示理解不足,无法在线重写提示以更好对齐用户意图,限制了模型的可扩展性与可用性。 Method: 提出RISE-T2V框架,引入Rephrasing Adapter模块,将提示重写与语义特征提取融合为一步,利用LLM在下一词预测中的隐藏状态作为视频生成条件,实现隐式提示扩展。 Result: 实验证明RISE-T2V可广泛适用于多种视频扩散模型架构,在不同T2V任务中显著提升生成视频的质量和与用户意图的对齐能力。 Conclusion: RISE-T2V通过结合大语言模型的语义理解能力,实现了更灵活、准确的文本到视频生成,增强了模型的通用性和实用性。 Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user's intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.

[83] Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography

Saúl Alonso-Monsalve,Leigh H. Whitehead,Adam Aurisano,Lorena Escudero Sanchez

Main category: cs.CV

TL;DR: 本文提出了一种基于体素稀疏化和子流形稀疏卷积网络的两阶段方法,用于在高分辨率3D CT图像中自动分割肾癌肿瘤,在KiTS23数据集上达到了与竞赛优胜者相当的性能,同时显著降低了计算资源消耗。

Details Motivation: 准确且高效地分割CT图像中的肿瘤是临床定量分析的关键瓶颈,传统卷积神经网络因3D扫描数据量大而难以直接处理高分辨率图像,需降采样或使用图像块,限制了精度和效率。 Method: 采用两阶段方法:首先进行体素稀疏化以保留重要信息并减少冗余数据,然后利用子流形稀疏卷积网络在高分辨率3D图像上进行端到端分割,避免降采样并提升计算效率。 Result: 在KiTS23挑战赛数据上实现了95.8%(肾脏+病灶)、85.7%(肿瘤+囊肿)和80.3%(仅肿瘤)的Dice相似系数,性能与优胜模型相当;同时推理时间最多减少60%,显存占用最多降低75%。 Conclusion: 该方法在保持高精度的同时显著提升了计算效率,适用于高分辨率3D医学图像的自动化肿瘤分割,具有良好的临床应用前景。 Abstract: The accurate delineation of tumours in radiological images like Computed Tomography is a very specialised and time-consuming task, and currently a bottleneck preventing quantitative analyses to be performed routinely in the clinical setting. For this reason, developing methods for the automated segmentation of tumours in medical imaging is of the utmost importance and has driven significant efforts in recent years. However, challenges regarding the impracticality of 3D scans, given the large amount of voxels to be analysed, usually requires the downsampling of such images or using patches thereof when applying traditional convolutional neural networks. To overcome this problem, in this paper we propose a new methodology that uses, divided into two stages, voxel sparsification and submanifold sparse convolutional networks. This method allows segmentations to be performed with high-resolution inputs and a native 3D model architecture, obtaining state-of-the-art accuracies while significantly reducing the computational resources needed in terms of GPU memory and time. We studied the deployment of this methodology in the context of Computed Tomography images of renal cancer patients from the KiTS23 challenge, and our method achieved results competitive with the challenge winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also offers significant computational improvements, achieving up to a 60% reduction in inference time and up to a 75\% reduction in VRAM usage compared to an equivalent dense architecture, across both CPU and various GPU cards tested.

[84] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset

Muhammad Annas Shaikh,Hamza Zaman,Arbaz Asif

Main category: cs.CV

TL;DR: 本文评估了九种卷积神经网络在VOC 2008数据集上进行马与摩托车二分类的性能,重点研究了数据增强对缓解类别不平衡的影响,发现ConvNeXt-Tiny表现最佳。

Details Motivation: 解决VOC 2008数据集中马和摩托车二分类任务中的类别不平衡问题,并比较不同CNN架构在该任务上的表现。 Method: 采用ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer等现代网络架构,结合少数类数据增强技术,在多个性能指标上进行对比实验。 Result: ConvNeXt-Tiny在马检测上达到95.53%的平均精度(AP),摩托车检测为89.12%,数据增强显著提升少数类检测性能,尤其有利于深层架构。 Conclusion: ConvNeXt-Tiny在处理类别不平衡的二分类任务中表现最优,数据增强能有效提升模型性能,为类似任务的架构选择和数据策略提供了实践指导。 Abstract: This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.

[85] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection

Sanjay Kumar,Tim Brophy,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising

Main category: cs.CV

TL;DR: 本文研究了在鸟瞰图(BEV)融合架构中,相机和激光雷达传感器在不同遮挡条件下的3D目标检测性能下降情况,发现模型更依赖LiDAR,尤其是在严重遮挡时,LiDAR性能显著下降,而相机遮挡对融合结果影响较小。

Details Motivation: 由于恶劣环境(如雾、霾或物理遮挡)可能导致传感器遮挡,但现有研究缺乏对遮挡如何影响多模态BEV融合检测性能的深入分析,因此本文旨在量化不同遮挡程度对相机和LiDAR及其融合性能的影响。 Method: 采用BEVFusion架构,在nuScenes数据集上评估相机和LiDAR在不同程度遮挡下的3D检测性能,使用mAP和NDS作为评价指标,分别模拟相机和LiDAR的遮挡场景并分析其影响。 Result: 仅使用相机时,中度遮挡导致mAP下降41.3%(从35.6%降至20.9%);LiDAR仅在重度遮挡下性能骤降47.3%(从64.7%降至34.1%),且严重影响远距离检测;在融合设置中,相机遮挡导致mAP微降4.1%(68.5%→65.7%),而LiDAR遮挡导致mAP大幅下降26.8%(至50.1%)。 Conclusion: 当前BEV融合模型高度依赖LiDAR,其在遮挡下的脆弱性凸显出开发抗遮挡融合方法和更鲁棒评估标准的重要性,未来需研究能在传感器退化情况下保持精度的感知系统。 Abstract: Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird's Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model's stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.

[86] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications

Puneet Mishra,Martijntje Vollebregt,Yizhou Ma,Maria Font-i-Furnols

Main category: cs.CV

TL;DR: 本教程旨在通过提供逐步指导,帮助分析化学领域研究人员利用现有的开源深度学习模型从成像数据中提取空间信息,并将其与光谱等其他数据源结合,提升数据解析能力。

Details Motivation: 尽管深度学习在图像处理方面取得了显著进展,但由于缺乏系统性的实施指南,其在分析化学中的应用仍然有限。传统化学计量方法难以有效提取和分析复杂的成像数据中的空间信息。 Method: 本教程不侧重于训练深度学习模型,而是聚焦于使用现有的开源深度学习模型来提取成像数据中的深层特征,并结合MATLAB代码演示多种常见成像模态的数据处理流程。 Result: 提供了可在实际数据集上运行的MATLAB代码示例,展示了如何从不同成像技术中提取空间信息并整合多源数据,增强了对复杂化学材料的探索与预测能力。 Conclusion: 该教程填补了深度学习在分析化学中应用的实践空白,为研究人员提供了可操作的工具和清晰的实施路径,有助于推动深度学习在该领域的普及与应用。 Abstract: Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial.

[87] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong,Yurong Mou,Hangcheng Li,Mingzhe Li,Yongzhuo Yang,Ming Zhang,Qiguang Chen,Tianyi Liang,Xiaomeng Hu,Yining Zheng,Xinchi Chen,Jun Zhao,Xuanjing Huang,Xipeng Qiu

Main category: cs.CV

TL;DR: 提出“用视频思考”新范式,利用视频生成模型(如Sora-2)统一视觉与文本推理,克服图文分离和静态图像局限,在视觉与文本任务上均表现出色。

Details Motivation: 现有“用文本思考”和“用图像思考”范式无法有效处理动态过程且图文分离,限制了多模态统一理解与生成。 Method: 提出“用视频思考”新范式,使用视频生成模型(如Sora-2)进行跨模态推理,并构建VideoThinkBench基准测试,包含视觉中心和文本中心两类任务。 Result: Sora-2在视觉任务上媲美甚至超越SOTA VLMs,在文本任务上取得92% MATH准确率和75.53% MMMU准确率,且自洽性和上下文学习可进一步提升性能。 Conclusion: 视频生成模型具备成为统一多模态理解与生成模型的潜力,“用视频思考”是一种有前景的统一多模态推理范式。 Abstract: "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

[88] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Itbaan Safwan,Muhammad Annas Shaikh,Muhammad Haaris,Ramail Khan,Muhammad Atif Tahir

Main category: cs.CV

TL;DR: 提出了一种基于LoRA微调Florence-2模型的多任务框架,用于同时进行视觉问答、解释生成和视觉定位,在MediaEval Medico 2025挑战赛中显著优于单任务基线。

Details Motivation: 为了提升医学视觉问答系统的准确性和可解释性,解决单一任务模型在跨模态理解与视觉定位上的局限性。 Method: 采用LoRA微调Florence-2模型,结合三个精心构建的数据集(Kvasir-VQA-x1、合成增强的解释数据集和文本到区域配对数据集),实现多任务联合学习。 Result: 在答案准确性和视觉定位方面均显著优于单任务基线,验证了多任务学习在医学VQA中的有效性。 Conclusion: 该多任务框架通过联合学习视觉接地、推理与解释,提升了医学VQA系统的性能与可解释性,适用于需要精确且可解释响应的临床应用场景。 Abstract: We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

[89] BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems

Chang Liu,Juan Li,Sheng Zhang,Chang Liu,Jie Li,Xu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种参数量仅为8.7M的新型单目深度估计模型BoRe-Depth,通过增强特征自适应融合模块和引入语义知识,显著提升了嵌入式系统上的深度估计性能和物体边界质量,并在NVIDIA Jetson Orin上实现50.7 FPS的高效运行。

Details Motivation: 现有单目深度估计方法在嵌入式系统上存在深度估计性能差和物体边界模糊的问题,亟需一种轻量且高精度的解决方案。 Method: 设计了增强特征自适应融合模块(EFAF)以提升边界细节表示,并在编码器中融入语义知识以增强物体识别与边界感知能力。 Result: BoRe-Depth在多个具有挑战性的数据集上显著优于先前的轻量级模型,在NVIDIA Jetson Orin上达到50.7 FPS,且边界质量明显改善。 Conclusion: 所提出的BoRe-Depth模型在保持轻量化的同时,有效提升了深度估计精度和边界清晰度,适用于资源受限的无人系统3D感知应用。 Abstract: Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.

[90] DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale

Ke Du,Yimin Peng,Chao Gao,Fan Zhou,Siqiao Xue

Main category: cs.CV

TL;DR: DORAEMON是一个开源的PyTorch库,统一了多尺度视觉对象建模与表示学习,支持分类、检索和度量学习,提供超过1000个预训练骨干网络,并可通过单命令导出至ONNX或HuggingFace,促进研究到部署的快速转化。

Details Motivation: 为了统一不同尺度下的视觉对象建模与表示学习,简化从研究到部署的流程,提升实验效率和可复现性。 Method: 采用YAML驱动的工作流,集成多种任务(分类、检索、度量学习),通过timm兼容接口提供大量预训练模型,并结合模块化损失、数据增强和分布式训练工具。 Result: 在ImageNet-1K、MS-Celeb-1M和Stanford Online Products等基准上复现或超越参考结果,并支持一键导出模型至ONNX或HuggingFace。 Conclusion: DORAEMON为视觉识别与表示学习提供了可扩展的统一平台,有效加速了研究进展向实际应用的迁移。 Abstract: DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.

[91] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats

Alan de Aguiar,Michaella Pereira Andrade,Charles Morphy D. Santos,João Paulo Gois

Main category: cs.CV

TL;DR: 本文提出了一种名为HideAndSeg的半自动化AI工具,用于在自然环境中分割章鱼视频,结合SAM2与自训练YOLOv11检测器,并引入两种无监督指标评估分割质量,显著减少人工干预,提升了对遮挡和复杂环境的鲁棒性。

Details Motivation: 由于章鱼具有伪装、快速变色、非刚体形变和频繁遮挡等特点,加之水下光照和浑浊度变化,使其在自然环境中的分析极具挑战;现有方法缺乏大规模标注数据集,难以实现高效自动分割。 Method: 将SAM2与自定义训练的YOLOv11目标检测器结合:首先由用户输入点坐标生成初始分割掩码,用作YOLO的训练数据;随后通过边界框提示实现SAM2的全自动分割流程,无需进一步人工干预,并引入时间一致性DICE_t和新组分计数NC_t两种无监督指标进行质量评估与掩码优化。 Result: HideAndSeg相比手动提示方法显著减少了分割噪声,能够在长时间完全遮挡后重新识别并分割章鱼,在真实自然场景中表现更鲁棒,实现了满意的分割性能。 Conclusion: 该方法大幅降低了对人工标注的依赖,为野生头足类动物的行为研究提供了一个高效、实用的视频分析工具,推动了无人工监督条件下海洋生物视频分析的发展。 Abstract: Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.

[92] Solving Convex Partition Visual Jigsaw Puzzles

Yaniv Ohayon,Ofir Itzhak Shahar,Ohad Ben-Shahar

Main category: cs.CV

TL;DR: 本文提出了一种针对凸多边形拼图的自动求解方法,利用几何和图像兼容性设计了贪婪求解器,并发布了首个此类拼图的基准数据集。

Details Motivation: 现有的自动拼图求解研究主要集中于方形拼图,限制了实际应用,因此需要扩展到更广泛的拼图类型。 Method: 结合几何和图像兼容性特征,提出一种贪婪求解算法,并构建凸分割拼图的基准数据集。 Result: 成功处理了凸多边形拼图,报告了多种性能指标,并提供了首个公开的凸分割拼图数据集。 Conclusion: 该方法显著扩展了可计算求解的拼图类型,为更广泛的实际应用奠定了基础。 Abstract: Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.

[93] V-Thinker: Interactive Thinking with Images

Runqi Qiao,Qiuna Tan,Minghan Yang,Guanting Dong,Peiqing Yang,Shiqiang Lang,Enhui Wan,Xiaowan Wang,Yida Xu,Lan Yang,Chong Sun,Chen Li,Honggang Zhang

Main category: cs.CV

TL;DR: 本文提出V-Thinker,一种通过端到端强化学习实现图像交互式推理的通用多模态推理助手,并构建数据进化飞轮和视觉渐进训练课程以提升模型在多样、高质量和高难度任务上的表现。

Details Motivation: 现有大多少模态模型在图像交互与长视野推理深度结合方面存在局限,且受限于视觉工具空间不足和特定任务流程设计,难以实现真正的图像为中心的推理。 Method: 提出V-Thinker,包含两个核心组件:数据进化飞轮(自动合成、演化和验证交互推理数据集)和视觉渐进训练课程(先通过点级监督对齐感知,再通过两阶段强化学习整合交互推理)。同时构建专家验证的VTBench基准测试。 Result: 实验表明,V-Thinker在通用和交互式推理场景中 consistently 优于强大的基于大多少模态模型的基线方法。 Conclusion: V-Thinker推动了“Thinking with Images”范式的发展,为实现深度图像交互与长视野推理提供了有效框架,并展示了强化学习在多模态推理中的潜力。 Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

[94] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability

Wenwen Li,Sizhe Wang,Hyunho Lee,Chenyan Lu,Sujit Roy,Rahul Ramachandran,Chia-Yu Hsu

Main category: cs.CV

TL;DR: 本研究提出了一种基于地理空间基础模型(GeoFMs)的三轴分析框架(传感器、标签、领域),用于滑坡制图,验证了Prithvi-EO-2.0模型在跨传感器、区域和标注数据稀缺条件下优于传统深度学习模型的性能。

Details Motivation: 传统深度学习模型在不同传感器、区域或标注数据有限的情况下泛化能力差,难以满足滑坡灾害精准及时制图的需求。 Method: 基于全球预训练、自监督学习和可适应微调的地理空间基础模型Prithvi-EO-2.0,构建传感器、标签和领域的三轴分析框架,并通过多组实验对比U-Net、Segformer等模型的性能。 Result: Prithvi-EO-2.0在多种条件下均优于任务特定CNN和视觉Transformer及其他GeoFMs,表现出对光谱变化的鲁棒性、在标注稀缺下的高准确性以及跨区域的良好泛化能力,但也存在计算成本高和可用AI就绪训练数据不足的问题。 Conclusion: 地理空间基础模型为滑坡风险减灾和环境监测提供了更鲁棒、可扩展的解决方案,是迈向通用遥感智能的重要一步。 Abstract: Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.

[95] THEval. Evaluation Framework for Talking Head Video Generation

Nabyl Quignon,Baptiste Chopin,Yaohui Wang,Antitza Dantcheva

Main category: cs.CV

TL;DR: 提出了一种新的评估框架,包含8个与质量、自然度和同步性相关的指标,用于更全面地评估说话人头像生成视频的质量。

Details Motivation: 现有的评估指标有限,主要集中在视频质量、唇音同步和用户研究上,无法充分评估快速发展的视频生成技术。 Method: 设计了一个包含8个指标的新评估框架,重点关注头部、嘴巴和眉毛的细粒度动态以及面部质量,并强调效率和与人类偏好的一致性。 Result: 在85,000个由17种最先进模型生成的视频上的实验表明,尽管许多算法在唇音同步方面表现出色,但在生成表现力和无伪影细节方面仍面临挑战。 Conclusion: 所提出的基准框架有助于评估生成方法的进步,代码、数据集和排行榜将公开发布并定期更新,以反映该领域的进展。 Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

[96] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy

Dimitrios Anastasiou,Santiago Barbarisi,Lucy Culshaw,Jayna Patel,Evangelos B. Mazomenos,Imanol Luengo,Danail Stoyanov

Main category: cs.CV

TL;DR: 本文提出了一种名为STC-Net的新框架,用于在腹腔镜胆囊切除术(LC)中基于Parkland分级量表(PGS)自动评估手术复杂性,能够在弱时间监督下直接处理完整手术视频,实现了优于基线方法的性能。

Details Motivation: 准确评估LC手术复杂性对临床决策至关重要,但现有方法多依赖静态图像或手动裁剪视频片段,难以应用于真实场景中的完整视频分析,因此需要一种能自动化、高效处理全视频的方法。 Method: 提出STC-Net框架,包含定位、窗口提议和分级模块,采用联合时间定位与分级策略,并引入结合硬性和软性定位目标及背景感知分级监督的新型损失函数,在弱时间监督下实现单时间戳复杂性估计。 Result: 在包含1,859个LC视频的私有数据集上,STC-Net的准确率为62.11%,F1分数为61.42%,相比非定位基线模型提升超过10%。 Conclusion: STC-Net展示了一种可扩展且有效的方法,可用于从完整LC视频中自动评估基于PGS的手术复杂性,具有用于术后分析和外科培训的潜力。 Abstract: Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.

[97] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

Chen Shi,Shaoshuai Shi,Xiaoyang Lyu,Chunyang Liu,Kehua Sheng,Bo Zhang,Li Jiang

Main category: cs.CV

TL;DR: UniSplat提出了一种用于自动驾驶中动态场景3D重建的通用前馈框架,通过统一的潜在时空融合实现鲁棒重建。

Details Motivation: 现有方法在稀疏、非重叠相机视图和复杂场景动态下表现不佳,难以实现高质量的动态场景重建。 Method: 构建一个3D潜在支架,利用预训练基础模型捕捉几何和语义上下文;引入高效的融合机制,在3D支架内进行跨时空信息整合;设计双分支解码器生成动态感知的高斯表示,并维护静态高斯的持久记忆以支持流式场景补全。 Result: 在真实世界数据集上的实验表明,UniSplat在新视角合成方面达到最先进水平,且在超出原始相机覆盖范围的视角下仍能提供鲁棒、高质量的渲染结果。 Conclusion: UniSplat通过统一的潜在时空融合和双分支解码结构,有效解决了稀疏多视角和动态场景下的3D重建难题,适用于自动驾驶中的实时场景理解与补全。 Abstract: Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.

[98] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Yicheng Xiao,Yu Chen,Haoxuan Ma,Jiale Hong,Caorui Li,Lingxiang Wu,Haiyun Guo,Jinqiao Wang

Main category: cs.CV

TL;DR: 本文提出了PixCLIP,一个能够同时处理视觉提示和长文本描述的新型框架,通过构建包含近150万样本的LongGRIT数据集和采用三分支像素-文本对齐学习框架,在细粒度图文对齐任务上实现了突破性性能。

Details Motivation: 尽管CLIP模型在多种视觉语言理解任务中表现出色,但其文本编码器的token长度限制阻碍了对更细粒度文本信息的处理,限制了细粒度图文对齐能力的提升。 Method: 提出PixCLIP框架:1)建立自动生成图像像素级局部化长文本描述的标注流水线,并构建LongGRIT数据集;2)用大语言模型(LLM)替换CLIP原始文本编码器;3)设计三分支像素-文本对齐学习框架,实现任意粒度的图像区域与文本描述的细粒度对齐。 Result: PixCLIP在像素级交互和长文本处理方面展现出显著优势,在多个基准测试中实现了最先进的性能,验证了同时增强视觉与文本处理粒度的有效性。 Conclusion: PixCLIP通过协同提升视觉和文本模态的信息处理粒度,有效解决了CLIP在细粒度图文对齐中的局限性,为未来多模态模型的发展提供了新方向。 Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

[99] Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

Tushar Kataria,Shikha Dubey,Mary Bronner,Jolanta Jedrzkiewicz,Ben J. Brintz,Shireen Y. Elhabian,Beatrice S. Knudsen

Main category: cs.CV

TL;DR: 提出了一种基于准确性的自动化框架,用于评估虚拟免疫组化(IHC)染色图像质量,通过像素级染色准确性指标(如Dice、IoU)衡量模型性能,发现传统图像保真度指标与实际染色准确性相关性差,配对模型表现更优,并强调全切片图像评估的重要性。

Details Motivation: 现有图像质量评估指标(如FID、PSNR、SSIM)仅衡量图像保真度,无法反映虚拟IHC染色的准确性,缺乏无需人工标注的客观评估方法。 Method: 采用颜色解卷积生成真实和虚拟IHC的棕色染色像素掩码,利用Dice、IoU和Hausdorff距离等指标量化染色准确性,评估16种配对或非配对图像翻译模型,同时在全切片图像上验证性能。 Result: 传统保真度指标与染色准确性和病理学家评估相关性差;配对模型(如PyramidPix2Pix、AdaptiveNCE)染色准确性最高,非配对扩散模型和GAN模型可靠性较低;全切片图像评估揭示出局部补丁评估无法发现的性能下降。 Conclusion: 所提出的框架为虚拟IHC模型提供了一种可重复、基于准确性的质量评估方法,对推动其在病理学中的实际应用具有重要意义。 Abstract: Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

[100] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

Kylie Cancilla,Alexander Moore,Amar Saini,Carmen Carrano

Main category: cs.CV

TL;DR: 提出了一种基于时间感知的流式无参考视频质量评估模型,无需人工标注或原始参考视频,通过合成退化数据训练,能有效预测全参考指标并优于传统方法。

Details Motivation: 现有视频质量评估方法依赖参考视频或昂贵的人类主观评分,且多数无参考方法忽略关键的时间上下文信息,难以适用于真实场景的可扩展视频分析。 Method: 利用DAVIS数据集的合成退化版本,训练一个具有时间感知能力的卷积网络架构,直接从退化视频中预测全参考指标(如LPIPS、PSNR、SSIM),在推理阶段无需参考视频。 Result: 该流式方法在多种退化类型上优于图像基线模型,并且与全参考指标的相关性高于BRISQUE等传统方法,验证了时序建模的有效性。 Conclusion: 所提出的无参考、意见无关的流式VQA模型通过引入时间建模,实现了对视频质量的高效准确评估,适用于实际视觉系统中的可扩展应用。 Abstract: Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

[101] Polarization-resolved imaging improves eye tracking

Mantas Žurauskas,Tom Bu,Sanaz Alali,Beyza Kalkanli,Derek Shi,Fernando Alamos,Gauresh Pandit,Christopher Mei,Ali Behrooz,Ramin Mirjalili,Dave Stronks,Alexander Fix,Dmitri Model

Main category: cs.CV

TL;DR: 本文提出了一种基于偏振分辨近红外成像的偏振眼动追踪(PET)系统,通过结合偏振滤波阵列相机和线性偏振近红外光源,增强了眼部特征的可见性,显著提升了在多种干扰条件下的眼动追踪精度。

Details Motivation: 传统强度成像在眼动追踪中受限于特征不足和外部干扰(如眼睑遮挡、瞳孔变化等),因此需要一种更鲁棒的光学对比机制来提升追踪性能。 Method: 采用偏振滤波阵列相机与线性偏振近红外光源构建PET系统,利用眼部组织对偏振光的反射特性获取额外对比度,并使用卷积神经网络模型训练PET数据以实现眼动估计。 Result: 在346名参与者的数据上,PET系统相比强度基线模型在正常及干扰条件下将中位95%绝对注视误差降低了10-16%。 Conclusion: 偏振成像能有效揭示传统强度图像中缺失的眼部可追踪特征,显著提升眼动追踪鲁棒性和精度,具备在可穿戴设备中应用的潜力。 Abstract: Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization--filter--array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10--16\% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light--tissue polarization effects to practical gains in human--computer interaction and position PET as a simple, robust sensing modality for future wearable devices.

[102] Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Ellis Brown,Jihan Yang,Shusheng Yang,Rob Fergus,Saining Xie

Main category: cs.CV

TL;DR: 本文提出了一种诊断和去偏多模态大模型基准测试的新框架,通过在纯文本输入上进行测试集压力测试(TsT)和迭代偏差剪枝(IBP),揭示并减轻非视觉偏差,提升了基准对真实视觉理解能力的评估有效性。

Details Motivation: 现有视觉导向的多模态基准可能被语言先验和表面模式所“作弊”,导致模型无需真正视觉理解即可取得高分,因此需要更鲁棒的基准设计原则。 Method: 提出Test-set Stress-Test (TsT) 方法,使用k折交叉验证在测试集的纯文本输入上微调大语言模型以检测捷径学习;结合基于手工特征的随机森林进行可解释审计;并通过Iterative Bias Pruning (IBP) 去除高偏差样本。 Result: 在四个主流基准(VSI-Bench、CV-Bench、MMMU、VideoMME)中发现广泛存在的非视觉偏差;构建了去偏版本VSI-Bench-Debiased,显著降低仅靠文本即可解题的比例,并扩大视觉盲模型与真实模型之间的性能差距。 Conclusion: 基准设计者应在发布前主动“攻击”自己的测试集,通过系统性诊断和去偏提升多模态基准的可靠性和有效性。 Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

[103] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown,Arijit Ray,Ranjay Krishna,Ross Girshick,Rob Fergus,Saining Xie

Main category: cs.CV

TL;DR: 本文提出了SIMS-V,一个利用3D模拟器生成空间丰富视频训练数据的框架,以提升多模态语言模型在现实世界中的空间推理能力。

Details Motivation: 现有的多模态语言模型在时空空间推理方面表现不足,且真实视频中精确空间标注的数据获取困难,限制了模型训练。 Method: 提出SIMS-V框架,利用3D模拟器的先验信息生成具有精细空间标注的视频数据,并通过系统性消融实验研究问题类型、组合和规模对现实世界迁移效果的影响。 Result: 发现仅需三种问题类别(度量测量、视角依赖推理和时间跟踪)即可最有效地提升空间智能迁移性能;在2.5万个模拟样本上微调的7B参数视频大模型,性能超过更大的72B基线模型,并在现实空间推理基准上媲美专有模型。 Conclusion: SIMS-V能高效训练具备强泛化能力的视频大模型,在保持一般视频理解性能的同时,显著提升具身和现实空间任务的表现。 Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

[104] Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang,Jihan Yang,Pinzhi Huang,Ellis Brown,Zihao Yang,Yue Yu,Shengbang Tong,Zihan Zheng,Yifan Xu,Muhan Wang,Daohan Lu,Rob Fergus,Yann LeCun,Li Fei-Fei,Saining Xie

Main category: cs.CV

TL;DR: 本文提出“超感知”(supersensing)作为推动多模态智能发展的新范式,涵盖语义感知、事件认知、隐式3D空间理解与预测性世界建模四个阶段,并发布VSI-SUPER基准测试来评估模型在长视频输入下的空间记忆与计数能力。实验表明单纯扩大数据规模不足以解决挑战,而基于自监督的预测潜帧模型通过利用预测误差驱动记忆和事件分割,显著优于现有方法。

Details Motivation: 现有视觉-语言模型主要依赖反应式任务和暴力扩展上下文,缺乏对真实世界动态的深层理解,难以实现真正的多模态智能。作者认为需要从被动感知转向主动构建内部世界模型,以推动空间认知的发展。 Method: 提出空间超感知的四阶段框架,并构建VSI-SUPER基准(包含VSR和VSC两个子任务),设计需处理任意长度视频且抵抗上下文堆砌的任务。通过构建大规模数据集VSI-590K并训练Cambrian-S模型测试数据扩展极限,进一步提出基于自监督的下一潜变量帧预测器,利用预测误差(surprise)进行记忆更新与事件分割。 Result: Cambrian-S在VSI-Bench上提升30%,但在VSI-SUPER上表现仍有限,说明规模不足。所提预测感知方法在VSI-SUPER上显著超越主流闭源基线,验证了预测机制对空间超感知的重要性。 Conclusion: 真正的空间超感知不仅要求模型能看,还需具备预测、选择和组织经验的能力;未来方向应聚焦于构建具有内在预测机制的主动感知系统,而非依赖更大规模的数据或上下文。 Abstract: We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

[105] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Jinlai Liu,Jian Han,Bin Yan,Hui Wu,Fengda Zhu,Xing Wang,Yi Jiang,Bingyue Peng,Zehuan Yuan

Main category: cs.CV

TL;DR: InfinityStar 是一个统一的时空自回归框架,用于高分辨率图像和动态视频合成,能够高效生成 720p 视频,并在多项指标上超越现有自回归和部分扩散模型。

Details Motivation: 现有的视频生成模型在效率和分辨率上存在局限,缺乏统一框架支持多种生成任务。 Method: 提出 InfinityStar,采用纯离散的自回归方法,在单一架构中联合建模空间和时间依赖,实现文本到图像、文本到视频、图像到视频等多种任务。 Result: 在 VBench 上得分为 83.74,显著优于其他自回归模型,甚至超过部分扩散模型(如 HunyuanVideo);无需额外优化即可比主流扩散方法快约 10 倍生成 5 秒 720p 视频。 Conclusion: InfinityStar 是首个能生成工业级 720p 视频的离散自回归视频生成模型,兼具高效性与高质量,推动了高效视频生成的发展。 Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

[106] Tracking and Understanding Object Transformations

Yihong Sun,Xinyu Yang,Jennifer J. Sun,Bharath Hariharan

Main category: cs.CV

TL;DR: 本文提出了Track Any State(TAS)任务,旨在跟踪物体在状态变换过程中的变化,并检测和描述这些变化。为此,作者构建了VOST-TAS数据集,并提出TubeletGraph方法,通过零样本学习恢复变换后的物体并构建状态演变图,在复杂物体变换的时序定位与语义推理中表现出色。

Details Motivation: 现有跟踪方法在物体发生显著外观变化(如切割、蜕变)时常丢失目标,难以理解物体的动态演变过程。因此,需要一种能够持续跟踪并理解物体状态变化的方法。 Method: 提出TubeletGraph,一种零样本系统:首先识别可能被忽略的轨迹片段,基于语义和空间邻近先验判断是否合并;然后推理新增轨迹,构建描述对象状态演变的图结构。同时发布了新基准数据集VOST-TAS。 Result: TubeletGraph在物体状态变换下的跟踪性能达到SOTA,不仅能有效恢复变换后物体的轨迹,还能生成描述状态变化的图结构,在时序定位和语义推理方面展现出强大能力。 Conclusion: TubeletGraph能够有效应对物体状态变换带来的跟踪挑战,实现了对物体演化过程的更深层次理解,为视频理解中的动态物体分析提供了新思路和工具。 Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

Rafe Loya,Andrew Hamara,Benjamin Estell,Benjamin Kilpatrick,Andrew C. Freeman

Main category: cs.CV

TL;DR: 本文探讨了自动生成多张具有美学吸引力的不同裁剪图像的问题,提出了一个包含277张图像和人类标注的数据集,并评估了多种单裁剪模型结合图像分割算法的效果。

Details Motivation: 现代社交媒体应用需要展示多张不同且美观的图片裁剪结果,但现有研究主要集中在单一裁剪,缺乏对多裁剪问题的关注。 Method: 引入一个新的数据集,并评估多种单裁剪模型在图像分割预处理后的表现。 Result: 提供了277张带有人类标签的图像数据集,评估结果显示结合图像分割的单裁剪模型可有效生成多个美观的裁剪区域。 Conclusion: 通过引入新数据集和评估方法,验证了图像分割与单裁剪模型结合在生成多样化美观裁剪方面的潜力,为多裁剪图像生成提供了可行方案。 Abstract: Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have proposed techniques for producing singular crops, little work has addressed the problem of producing multiple, distinct crops with aesthetic appeal. In this paper, we motivate the problem with a discussion on modern social media applications, introduce a dataset of 277 relevant images and human labels, and evaluate the efficacy of several single-crop models with an image partitioning algorithm as a pre-processing step. The dataset is available at https://github.com/RafeLoya/carousel.