Skip to content

Table of Contents

cs.CL [Back]

[1] Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias

Sirui Wu,Daijin Yang

Main category: cs.CL

TL;DR: 本研究探讨了利用大语言模型(GPT-3)对人格测验题目进行中立化处理以减少社会期望偏差的效果。结果显示,中立化版本在保持信度和五因素结构的同时,部分维度与社会期望的关联降低,但测量不变性未完全满足,表明该方法具有潜力但尚不完善。

Details Motivation: 社会期望偏差会影响人格测评的真实性,尤其是在自陈量表中。为了提高测量的准确性,研究探索使用大语言模型对测试题目进行中立化处理,以削弱被试迎合社会期望的倾向。 Method: 研究使用GPT-3对国际人格题库大五人格量表(IPIP-BFM-50)进行题目中立化重写,并让203名参与者分别完成原始版本或中立化版本,同时填写Marlowe-Crowne社会期望量表,随后比较两个版本的心理测量特性。 Result: 中立化版本保持了良好的信度和五因素结构,尽管理论上预期减少与社会期望的相关性,但这种下降并不一致;Conscientiousness维度有所提升,Agreeableness和Openness则下降;configural不变性成立,但metric和scalar不变性未能通过。 Conclusion: AI辅助的题目中立化是一种有前景但尚未成熟的方法,能在一定程度上缓解社会期望偏差,但在保证测量等值性方面仍存在挑战,需进一步优化。 Abstract: This study evaluates item neutralization assisted by the large language model (LLM) to reduce social desirability bias in personality assessment. GPT-o3 was used to rewrite the International Personality Item Pool Big Five Measure (IPIP-BFM-50), and 203 participants completed either the original or neutralized form along with the Marlowe-Crowne Social Desirability Scale. The results showed preserved reliability and a five-factor structure, with gains in Conscientiousness and declines in Agreeableness and Openness. The correlations with social desirability decreased for several items, but inconsistently. Configural invariance held, though metric and scalar invariance failed. Findings support AI neutralization as a potential but imperfect bias-reduction method.

[2] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Gyubok Lee,Elea Bach,Eric Yang,Tom Pollard,Alistair Johnson,Edward Choi,Yugang jia,Jong Ha Lee

Main category: cs.CL

TL;DR: 本文提出了FHIR-AgentBench,一个基于HL7 FHIR标准的真实临床问答基准,用于评估大语言模型在互操作性医疗数据上的表现。

Details Motivation: 现有基准未能跟上HL7 FHIR标准的普及,缺乏对LLM在真实、资源型医疗数据上评估的能力。 Method: 构建包含2,931个真实临床问题的FHIR-AgentBench基准,并系统评估不同数据检索策略、交互模式和推理方法的表现。 Result: 实验表明从复杂FHIR资源中检索和推理数据存在显著挑战,严重影响问答性能。 Conclusion: FHIR-AgentBench为开发可靠、可复现的临床LLM代理提供了重要工具,推动基于FHIR的AI应用发展。 Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

[3] Readme_AI: Dynamic Context Construction for Large Language Models

Millie Vyas,Timothy Blattner,Alden Dima

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的规范(Readme_AI Model Context Protocol, MCP),用于为大型语言模型(LLMs)动态构建特定数据源的上下文,以减少其在回答专业问题时的幻觉和不准确性。

Details Motivation: 尽管LLMs训练于大量数据,但在特定查询背景下仍可能产生不准确或不可靠的信息。为了提升其响应的准确性和实用性,需要为其提供查询相关的上下文信息。 Method: 设计并实现了一个名为Readme_AI MCP的协议,由数据源所有者提供包含元数据的文件,支持网页爬取、数据仓库获取、文献解析等多种动态内容类型,并通过用户指定标签对上下文进行组织和格式化,供LLM推理使用。 Result: 通过NIST开发的Hedgehog库的案例展示了该方法的有效性:原本容易产生错误回答的LLM在引入Readme_AI提供的上下文后,能够正确理解库的功能、用途,并基于示例生成代码。 Conclusion: 该研究提出了一种可扩展、动态的协议,能有效将LLM与所有者提供的专用数据进行对接,显著增强响应准确性并减少幻觉,具有实际应用潜力。 Abstract: Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user's specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog's developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .

[4] Magnitude Matters: a Superior Class of Similarity Metrics for Holistic Semantic Understanding

V. S. Raghu Parupudi

Main category: cs.CL

TL;DR: 本文提出了一类新的无参数、感知向量模长的相似性度量方法(重叠相似性OS和双曲正切相似性HTS),在需要整体语义理解的任务(如复述和推理)上显著优于传统的点积和余弦相似性。

Details Motivation: 现有的高维向量比较方法(如点积和余弦相似性)要么受向量模长影响,要么完全忽略模长信息,无法有效平衡方向与大小的作用。 Method: 提出了两种新的相似性度量函数——重叠相似性(OS)和双曲正切相似性(HTS),并在四种先进的句子嵌入模型和八个标准NLP基准上进行了全面评估,使用Wilcoxon符号秩检验验证统计显著性。 Result: 在复述和推理任务上,OS和HTS在均方误差上显著优于点积和余弦相似性,且结果不受嵌入模型影响;但在强调组合语义的任务(如SICK、STS-B)上未见显著提升。 Conclusion: 模长感知的相似性度量在整体语义理解任务中表现更优,揭示了组合语义表示是一个值得深入研究的方向。 Abstract: Vector comparison in high dimensions is a fundamental task in NLP, yet it is dominated by two baselines: the raw dot product, which is unbounded and sensitive to vector norms, and the cosine similarity, which discards magnitude information entirely. This paper challenges both standards by proposing and rigorously evaluating a new class of parameter-free, magnitude-aware similarity metrics. I introduce two such functions, Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS), designed to integrate vector magnitude and alignment in a more principled manner. To ensure that my findings are robust and generalizable, I conducted a comprehensive evaluation using four state-of-the-art sentence embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, paraphrase-mpnet-base-v2, and BAAI/bge-large-en-v1.5) across a diverse suite of eight standard NLP benchmarks, including STS-B, SICK, Quora, and PAWS. Using the Wilcoxon signed-rank test for statistical significance, my results are definitive: on the tasks requiring holistic semantic understanding (paraphrase and inference), both OS and HTS provide a statistically significant improvement in Mean Squared Error over both the raw dot product and cosine similarity, regardless of the underlying embedding model.Crucially, my findings delineate the specific domain of advantage for these metrics: for tasks requiring holistic semantic understanding like paraphrase and inference, my magnitude-aware metrics offer a statistically superior alternative. This significant improvement was not observed on benchmarks designed to test highly nuanced compositional semantics (SICK, STS-B), identifying the challenge of representing compositional text as a distinct and important direction for future work.

[5] How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs

Jian Ouyang,Arman T,Ge Jin

Main category: cs.CL

TL;DR: 本研究探讨了在监督微调过程中错误数据对大语言模型(如gpt-4o)性能和安全性的影响,发现即使少量错误数据也会显著降低模型表现并导致“突发性错位”,强调高质量数据整理或直接使用强基线模型的重要性。

Details Motivation: 由于大语言模型在金融、编程、法律和医疗等关键领域日益重要,但微调过程中若使用错误数据可能导致模型产生有害或欺骗性输出,因此需要系统评估错误数据的影响。 Method: 通过在四个领域(编程、金融、健康、法律)中使用不同比例(10%到90%正确)的明显和细微错误数据对gpt-4o进行微调,并评估其性能与道德对齐情况。 Result: 即使10-25%的错误数据也会严重损害模型的领域性能和道德对齐;至少需要50%的正确数据才能恢复较好性能,但微调模型仍难以匹敌原始基模型的安全性和稳健性。 Conclusion: 错误数据代价高昂,高风险应用中应优先保证数据质量或避免不必要的微调,直接依赖强健的基模型。 Abstract: This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs), specifically gpt-4o, during supervised fine-tuning (SFT). Although LLMs become increasingly vital across broad domains like finance, coding, law, and health, fine-tuning on incorrect data can lead to "emergent misalignment," producing harmful or deceptive outputs unrelated to the intended task. We evaluate gpt-4o models fine-tuned with varying ratios (10\% to 90\% correct) of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal. Our findings show that even modest amounts of incorrect data (10-25\%) dramatically degrade domain performance and not moral alignment. A clear threshold of at least 50\% correct data is needed for models to consistently recover strong performance, though they rarely match the robustness and safety of the base model, which exhibits near-perfect alignment and zero dangerous completions out-of-the-box. This research emphasizes that the cost of incorrect data is heavy, highlighting the critical need for extremely high-quality data curation or, alternatively, leveraging robust base models without unnecessary fine-tuning for high-stakes applications.

[6] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers

Ruochi Li,Haoxuan Zhang,Edward Gehringer,Ting Xiao,Junhua Ding,Haihua Chen

Main category: cs.CL

TL;DR: 该论文提出了一种综合评估框架,用于系统评估大语言模型(LLM)生成的同行评审质量,发现LLM在描述性内容上表现良好,但在识别论文弱点和质量敏感性方面显著不足。

Details Motivation: 随着科学投稿量激增,传统同行评审压力增大,研究者探索使用大语言模型自动生成评审意见,但其在批判性思维、上下文理解和质量敏感性方面的局限亟需系统评估。 Method: 构建包含1,683篇论文和6,495条专家评审意见的大规模基准数据集,结合语义相似性分析与结构化知识图谱指标,对五种LLM生成的评审进行评估。 Result: LLM在总结论文贡献和方法方面表现良好(如GPT-4o在ICLR 2025好文的强项部分多生成15.74%实体),但在指出弱点方面显著不足(GPT-4o弱项实体少59.42%),且对论文质量变化反馈调整能力差(节点数仅增加5.7%,人类为50%)。 Conclusion: 当前LLM生成的评审在描述性任务上可行,但在批判性和质量敏感性任务上仍有明显缺陷,需进一步改进以支持实际评审辅助工具开发。 Abstract: The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at https://github.com/RichardLRC/Peer-Review.

[7] A systematic review of trial-matching pipelines using large language models

Braxton A. Morrison,Madhumita Sushil,Jacob S. Young

Main category: cs.CL

TL;DR: 该论文系统综述了2020至2025年间基于大语言模型(LLM)的临床试验匹配研究,发现GPT-4在匹配和资格提取任务中表现最佳,但成本较高;合成数据使用普遍,跨研究可比性受限,未来需标准化评估指标、真实测试集及关注成本效益与公平性。

Details Motivation: 临床试验患者匹配对发现新疗法至关重要,但人工匹配耗时且易出错,亟需自动化解决方案。 Method: 通过系统性文献回顾,从三个学术数据库和一个预印本服务器中筛选2020–2025年的相关研究,分析LLM在患者-标准、患者-试验等匹配任务中的应用、数据来源、模型性能及挑战。 Result: 在126篇初筛文章中,31篇符合纳入标准;多数研究关注单一匹配任务;GPT-4在直接比较中优于其他模型;零样本提示、高级检索和微调开源小模型是有效策略;数据稀缺、成本、幻觉、数据泄露和偏见是主要挑战。 Conclusion: LLM在临床试验匹配中展现出潜力,尤其是GPT-4,但需解决数据真实性、成本、安全性及评估标准化问题,以推动其在医疗系统中的广泛应用。 Abstract: Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology. However, manual matching is labor-intensive and error-prone, leading to recruitment delays. Pipelines incorporating large language models (LLMs) offer a promising solution. We conducted a systematic review of studies published between 2020 and 2025 from three academic databases and one preprint server, identifying LLM-based approaches to clinical trial matching. Of 126 unique articles, 31 met inclusion criteria. Reviewed studies focused on matching patient-to-criterion only (n=4), patient-to-trial only (n=10), trial-to-patient only (n=2), binary eligibility classification only (n=1) or combined tasks (n=14). Sixteen used synthetic data; fourteen used real patient data; one used both. Variability in datasets and evaluation metrics limited cross-study comparability. In studies with direct comparisons, the GPT-4 model consistently outperformed other models, even finely-tuned ones, in matching and eligibility extraction, albeit at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs like the GPT-4o model, advanced retrieval methods, and fine-tuning smaller, open-source models for data privacy when incorporation of large models into hospital infrastructure is infeasible. Key challenges include accessing sufficiently large real-world data sets, and deployment-associated challenges such as reducing cost, mitigating risk of hallucinations, data leakage, and bias. This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations. Standardized metrics, more realistic test sets, and attention to cost-efficiency and fairness will be critical for broader deployment.

[8] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

Julie Jung,Max Lu,Sina Chole Benker,Dogus Darici

Main category: cs.CL

TL;DR: 研究探讨了模型大小、温度和提示风格对大型语言模型在临床推理能力评估中与自身、模型间以及与人类一致性的影响。

Details Motivation: 探索影响大型语言模型在临床推理评估中一致性的关键因素。 Method: 分析模型大小、温度和提示风格对LLM在自我一致性、模型间一致性和人类一致性方面的影响。 Result: 模型大小是影响LLM与人类评分一致性的关键因素,研究强调需在多个层面检查一致性。 Conclusion: 模型规模显著影响LLM在临床推理评估中的一致性表现,多层级一致性验证至关重要。 Abstract: We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.

[9] Quantifying Compositionality of Classic and State-of-the-Art Embeddings

Zhijin Guo,Chenhao Xue,Zhaozhen Xu,Hongbo Bo,Yuxuan Ye,Janet B. Pierrehumbert,Martha Lewis

Main category: cs.CL

TL;DR: 本文提出了一种两步评估方法,用于量化语言模型中的加性组合性,通过典型相关分析和重构嵌入来衡量不同训练阶段和模型层的组合性表现。

Details Motivation: 现有语言模型在组合性方面存在不足:静态词嵌入过度强调组合性,而当前主流的生成式模型(如Transformer)则缺乏对语境中意义变化的限制,因此需要一种有效的方法来量化模型的加性组合性。 Method: 提出一个两步评估框架:(1) 使用典型相关分析(CCA)测量实体属性与其嵌入之间的线性关系;(2) 通过重建未见属性组合的嵌入并评估L2损失、余弦相似度和检索准确率来检验加性泛化能力,并在句子、知识图谱和词嵌入中追踪各层和训练阶段的组合性。 Result: 实验发现,在不同数据模态中,后期训练阶段表现出更强的组合性信号;在基于Transformer的模型中,深层具有更强的组合性,但在顶层有所下降。该方法还能识别线性组合失效的情况。 Conclusion: 该研究提供了一个可量化语言模型中加性组合性的评估框架,揭示了模型在不同层次和训练阶段的组合性动态变化,有助于改进模型对新表达式的泛化能力。 Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don't know what a "pelp" is, we can use our knowledge of numbers to understand that "ten pelps" makes more pelps than "two pelps". Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.

[10] Pluralistic Off-policy Evaluation and Alignment

Chengkai Huang,Junda Wu,Zhouhang Xie,Yu Xia,Rui Wang,Tong Yu,Subrata Mitra,Julian McAuley,Lina Yao

Main category: cs.CL

TL;DR: 本文提出了Pluralistic Off-Policy Evaluation (POPE),首个用于大语言模型离线多偏好评估与对齐的框架,结合协同效用与多样性成分,实现个性化偏好对齐。

Details Motivation: 现有偏好对齐数据集和离策略估计方法大多忽略人类偏好的多样性,仅关注整体效用,难以反映真实多元偏好,因此需要一种能同时评估相关性和多样性的新框架。 Method: 提出POPE框架,包含统一的奖励函数(结合人类偏好信号的协同效用和基于熵的多样性成分),并设计可分解的逆倾向评分(IPS)估计器分别评估相关性和多样性,进而支持离策略优化。 Result: 实验证明POPE能有效提升生成结果的多样性与相关性,同时保持模型在下游任务中的通用性能。 Conclusion: POPE为大语言模型的多元偏好对齐提供了有效且理论可靠的离线评估与优化方案,推动个性化对齐研究的发展。 Abstract: Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models' general capabilities on downstream tasks

[11] Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation

Qingsong Wang,Tao Wu,Wang Lin,Yueying Feng,Gongsheng Yuan,Chang Yao,Jingyuan Chen

Main category: cs.CL

TL;DR: 提出了一种认知层次对齐框架(CLAF),通过知识图谱和风格优化模块,使大语言模型输出适应不同认知水平的用户。

Details Motivation: 解决大语言模型在生成内容时与用户认知能力不匹配的问题,即认知错位,包括知识水平和表达方式的错位。 Method: 设计了包含能力感知检索、风格优化和知识可控生成的CLAF框架,并构建了多层级理解的数据集SCALE用于训练和评估。 Result: 实验结果表明,CLAF能有效提升模型输出在不同用户群体中的适应性和信息量。 Conclusion: CLAF为实现大语言模型在真实场景中面向多样化用户认知水平的个性化生成提供了有效解决方案。 Abstract: Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom's taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.

[12] Part-of-speech tagging for Nagamese Language using CRF

Alovi N Shohe,Chonglio Khiamungam,Teisovi Angami

Main category: cs.CL

TL;DR: 本文首次对纳加梅塞语进行词性标注研究,构建了包含16,112个标记词元的语料库,并采用条件随机场(CRF)方法实现了85.70%的准确率。

Details Motivation: 纳加梅塞语作为资源稀缺语言,在自然语言处理中的词性标注研究尚属空白,本文旨在填补这一空白。 Method: 构建了一个包含16,112个词元的标注语料库,并应用条件随机场(CRF)模型进行词性标注。 Result: 使用CRF模型取得了85.70%的整体标注准确率,精确率为86%,召回率为86%,F1分数为85%。 Conclusion: 本文是纳加梅塞语词性标注的首次尝试,实验结果表明CRF在该任务上具有良好的性能,为后续的低资源语言处理研究提供了基础。 Abstract: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

[13] Performance of Large Language Models in Answering Critical Care Medicine Questions

Mahmoud Alwakeel,Aditya Nagori,An-Kwok Ian Wong,Neal Chaisson,Vijay Krishnamoorthy,Rishikesan Kamaleswaran

Main category: cs.CL

TL;DR: 本研究评估了Meta-Llama 3.1模型(8B和70B参数)在871道重症监护医学(CCM)问题上的表现,发现70B模型以60%的平均准确率显著优于8B模型,但在不同领域表现存在差异,研究领域最高(68.4%),肾脏领域最低(47.9%)。

Details Motivation: 探索大型语言模型在重症监护医学等专业医学领域的性能,弥补其在医学高阶专科应用中的研究空白。 Method: 使用871道CCM选择题测试Meta-Llama 3.1的8B和70B参数模型,比较其准确率并分析在不同子领域(如呼吸、循环、肾脏等)的表现差异。 Result: Llama3.1:70B模型平均准确率为60%,比8B模型高30%;各领域表现不一,研究方法类题目准确率最高(68.4%),肾脏相关最低(47.9%)。 Conclusion: 尽管大模型在CCM领域展现出一定能力,但其性能在不同子领域存在显著差异,未来需针对性优化以提升其在专科医学中的全面适用性。 Abstract: Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.

[14] SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Renyu Li,Antonio Jimeno Yepes,Yao You,Kamil Pluciński,Maximilian Operlejn,Crag Wolfe

Main category: cs.CL

TL;DR: 本文提出了SCORE,一种面向多模态生成式文档解析系统的鲁棒评估框架,通过内容与结构的解耦评价,解决传统指标误将合理结构差异判为错误的问题。

Details Motivation: 传统评估指标(如CER、WER、IoU、TEDS)无法区分语义正确但结构不同的生成结果,导致对生成式文档解析系统的不公平评估。 Method: 提出SCORE框架,包含调整后的编辑距离、词元级诊断、具有空间容差和语义对齐的表格评估,以及层次感知的一致性检查,并将生成输出标准化为格式无关表示。 Result: 在1,114页的真实数据上验证,SCORE发现了传统指标忽略的跨数据集性能模式,在2-5%的模糊表格页面中修正了12-25%的误罚,恢复了有效解释间的等价性,且无需目标检测即可复现传统指标得分(如表格F1达0.93)。 Conclusion: SCORE建立了语义严谨、公平且实用的文档解析评估新基准,支持对解释多样性进行多维可解释诊断。 Abstract: Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.

[15] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches

Maryam Mahdi Alhusseini,Mohammad-Reza Feizi-Derakhshi

Main category: cs.CL

TL;DR: 本研究提出了一种双视角方法,结合词典法(TextBlob)与深度学习模型(CNN、Bi-LSTM),分析Google Play上ChatGPT和DeepSeek的用户评论情感。使用4000条真实评论构建数据集,并通过过采样实现类别平衡。实验结果表明,ChatGPT的正面情绪显著高于DeepSeek;在分类性能上,CNN以96.41%的准确率优于Bi-LSTM,尤其在负向情感识别上接近完美。该研究为评估大语言模型应用的用户满意度建立了新的方法标准。

Details Motivation: 现有研究多单独使用词典法或深度学习模型进行情感分析,缺乏对大语言模型应用用户满意度的综合评估。因此,本文旨在结合两种方法,全面比较ChatGPT与DeepSeek的用户反馈,提升情感分析的准确性与实用性。 Method: 收集Google Play上ChatGPT和DeepSeek共4000条用户评论,进行文本预处理并采用过采样技术平衡类别分布。采用TextBlob进行词典式情感分析,同时构建CNN和Bi-LSTM深度学习模型进行情感分类,在1700条评论的平衡测试集上评估模型性能。 Result: 实验结果显示,ChatGPT获得的正面情感显著高于DeepSeek;在模型表现方面,深度学习模型整体优于词典法,其中CNN达到96.41%的准确率,对负面评论分类近乎完美,且在中性和正面情感上具有高F1分数,优于Bi-LSTM。 Conclusion: 本研究验证了深度学习模型在分析LLM应用用户评论中的优越性,尤其是CNN的高性能表现,确立了衡量大语言模型应用情感的新方法论标准,为开发者优化用户中心的AI系统提供了实践指导。 Abstract: This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.

[16] Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks

Sara Todorovikj,Lars-Peter Meyer,Michael Martin

Main category: cs.CL

TL;DR: 提出一种基于认知心理学复杂性框架的补充性任务特征分析方法,用于评估大语言模型在知识图谱任务中的表现。

Details Motivation: 现有评估方法主要关注准确性和输出正确性,缺乏对任务认知复杂性的深入理解。 Method: 引入三个来自认知心理学的复杂性框架,应用于LLM-KG-Bench框架进行分析。 Result: 揭示了任务的价值分布,识别出未被充分代表的认知需求。 Conclusion: 该方法有助于更丰富地解释和提升基准评估任务的多样性。 Abstract: Large Language Models (LLMs) are increasingly used for tasks involving Knowledge Graphs (KGs), whose evaluation typically focuses on accuracy and output correctness. We propose a complementary task characterization approach using three complexity frameworks from cognitive psychology. Applying this to the LLM-KG-Bench framework, we highlight value distributions, identify underrepresented demands and motivate richer interpretation and diversity for benchmark evaluation tasks.

[17] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Robert Tjarko Lange,Yuki Imajuku,Edoardo Cetin

Main category: cs.CL

TL;DR: ShinkaEvolve是一个开源框架,利用大语言模型通过进化代理机制推动科学发现,在样本效率和解决方案质量方面表现出色。

Details Motivation: 现有代码进化方法样本效率低且多为闭源,限制了广泛应用和改进。 Method: 引入三种创新:平衡探索与利用的父代采样技术、用于高效搜索空间探索的代码新颖性拒绝采样、基于赌博机的LLM集成选择策略。 Result: 在多种任务中验证,仅用150个样本即发现最新的圆 packing 解、优化AIME数学推理代理、改进ALE-Bench编程解,并发现新的MoE负载均衡损失函数。 Conclusion: ShinkaEvolve实现了高样本效率和广泛适用性,通过开源促进开放-ended的科学发现。 Abstract: We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.

[18] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities

Jiajun Chen,Yangyang Wu,Xiaoye Miao,Mengying Zhu,Meng Xi

Main category: cs.CL

TL;DR: 本文提出了一种名为TriSPrompt的分层软提示模型,用于在不完整的多模态数据中有效检测谣言,通过三种提示机制(模态感知、模态缺失和互视图)提升对缺失模态的适应性和检测性能。

Details Motivation: 现有的多模态谣言检测方法主要依赖完整的训练数据,难以应对现实场景中普遍存在的模态缺失问题。 Method: 提出TriSPrompt模型,结合模态感知(MA)提示、模态缺失(MM)提示和互视图(MV)提示,在不完整多模态数据下进行谣言检测。MA提示用于模态恢复,MM提示建模缺失状态,MV提示学习主客观视角间的关系。 Result: 在三个真实世界基准上的实验表明,TriSPrompt相比现有最先进方法准确率提升超过13%。 Conclusion: TriSPrompt能有效应对多模态数据中的模态缺失问题,在谣言检测任务中显著优于现有方法。 Abstract: The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emph{complete} multimodal training data, rendering them ineffective in addressing the common occurrence of \emph{missing modalities} in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsf{TriSPrompt}, which integrates three types of prompts, \textit{i.e.}, \emph{modality-aware} (MA) prompt, \emph{modality-missing} (MM) prompt, and \emph{mutual-views} (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model's adaptability to missing information. The MV prompt learns relationships between subjective (\textit{i.e.}, text and image) and objective (\textit{i.e.}, comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsf{TriSPrompt} achieves an accuracy gain of over 13\% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.4open.science/r/code-3E88.

[19] RoadMind: Towards a Geospatial AI Expert for Disaster Response

Ahmed El Fekih Zguir,Ferda Ofli,Muhammad Imran

Main category: cs.CL

TL;DR: RoadMind是一个自监督框架,利用OpenStreetMap的结构化数据增强大语言模型在道路网络、距离和方向等地理空间推理能力,显著提升其在灾害响应中的应用效果。

Details Motivation: 大语言模型在自然语言任务中表现优异,但在地理空间推理(如道路网络、距离和方向)方面存在不足,尤其影响灾害场景下的疏散规划和资源分配等关键任务。 Method: 提出RoadMind框架,通过自动化管道从OpenStreetMap提取城市道路基础设施数据,并将其转换为多种监督格式;使用QLoRA适配器和4位量化模型对LLMs进行预训练和微调。 Result: 在洛杉矶、克赖斯特彻奇和马尼拉三个灾害频发城市评估表明,RoadMind在道路段识别、最近道路检索及距离/方向估计任务上显著优于强基线模型,包括经过高级提示工程的最先进大语言模型。 Conclusion: 结构化地理空间数据能有效增强大语言模型的空间推理能力,RoadMind为构建更高效的离线AI系统用于灾害响应提供了可行路径。 Abstract: Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.

[20] Benchmarking and Improving LLM Robustness for Personalized Generation

Chimaobi Okite,Naihao Deng,Kiran Bodipati,Huaidian Hou,Joyce Chai,Rada Mihalcea

Main category: cs.CL

TL;DR: 本文提出了一个用于评估大语言模型个性化响应中鲁棒性的可扩展框架PERG及新数据集PERGData,发现现有模型在保持事实准确性方面存在显著问题,并提出Pref-Aligner方法平均提升25%的鲁棒性。

Details Motivation: 现有对大语言模型个性化的评估主要关注响应是否符合用户偏好,而忽略了事实准确性这一重要维度。作者认为,一个鲁棒的模型应同时具备事实正确性和用户偏好对齐能力,因此需要新的评估框架和指标。 Method: 提出PERG评估框架和PERGData数据集,对来自五个模型家族的十四个模型采用不同提示方法进行评测,并设计两阶段的Pref-Aligner方法以提升鲁棒性。 Result: 实验表明当前大语言模型在个性化时鲁棒性不足:最强模型(如GPT-4.1、LLaMA3-70B)在5%的案例中失去正确性,小模型(如7B级)失败率超20%;鲁棒性受查询类型和用户偏好类型影响显著;Pref-Aligner平均提升25%的鲁棒性。 Conclusion: 当前大语言模型在个性化过程中存在事实性与偏好对齐之间的权衡问题,需重视鲁棒性评估;PERG框架和Pref-Aligner方法为构建更可靠、用户对齐的系统提供了有效工具和改进路径。 Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user's preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.

[21] Semantic Representation Attack against Aligned Large Language Models

Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Shaohui Mei,Lap-Pui Chau

Main category: cs.CL

TL;DR: 提出了一种新的对抗攻击范式——语义表示攻击,通过利用语义空间中具有等效有害意义的多样化响应,克服了现有方法在攻击效果与提示自然性之间的权衡,实现了高攻击成功率的同时保持隐蔽性和效率。

Details Motivation: 现有的针对对齐大语言模型的攻击方法受限于低收敛性、不自然的提示和高计算成本,难以有效绕过安全对齐机制。 Method: 提出语义表示攻击新范式,采用语义表示启发式搜索算法,在保持可解释性的前提下逐步生成语义连贯且简洁的对抗性提示,避免依赖特定文本模式。 Result: 在18个大语言模型上平均攻击成功率达89.41%,其中11个模型达到100%成功率,显著优于现有方法,同时具备良好的隐蔽性和计算效率。 Conclusion: 语义表示攻击为评估和增强对齐大语言模型的安全性提供了新方向,揭示了当前对齐技术在语义层面仍存在严重漏洞。 Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

[22] The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

Angelina Wang,Daniel E. Ho,Sanmi Koyejo

Main category: cs.CL

TL;DR: 本文指出传统的离线评估方法无法准确反映语言模型在实际应用中的表现,因为个性化会显著改变模型行为。作者通过对比离线评估与真实用户环境下的现场评估,提供了实证证据。

Details Motivation: 传统离线评估忽略个性化对模型行为的影响,导致评估结果与实际使用情况脱节。 Method: 通过让800名ChatGPT和Gemini的真实用户在其聊天界面中提出基准问题和其他问题,进行现场评估,并与离线评估结果进行对比。 Result: 发现相同模型在无状态系统、不同用户会话中对相同问题的响应存在显著差异,证明个性化显著影响模型输出。 Conclusion: 语言模型的评估应考虑个性化和用户上下文的影响,现场评估比传统离线评估更能反映真实性能。 Abstract: Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.

[23] LLM-Assisted Topic Reduction for BERTopic on Social Media Data

Wannes Janssens,Matthias Bogaert,Dirk Van den Poel

Main category: cs.CL

TL;DR: 提出一种结合BERTopic和大语言模型的主题建模方法,先用BERTopic生成主题,再用大语言模型进行语义相似性判断并合并主题,提升主题多样性与连贯性。

Details Motivation: BERTopic在处理社交媒体等噪声多、稀疏的文本时易产生过多重叠主题;而端到端的大语言模型方法计算开销大,难以扩展。 Method: 先使用BERTopic生成初始主题及其表示,再将这些表示输入大语言模型,由模型迭代识别并合并语义相似的主题。 Result: 在三个Twitter/X数据集和四种语言模型上验证,该方法在主题多样性和多数情况下的主题连贯性上优于基线方法,但对数据集特征和初始参数有一定敏感性。 Conclusion: 结合BERTopic与大语言模型进行主题降维是有效且可行的,能在保持可扩展性的同时提升主题质量。 Abstract: The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.

[24] Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Ruanjun Li,Ziheng Liu,Yuanming Shi,Jiawei Shao,Chi Zhang,Xuelong Li

Main category: cs.CL

TL;DR: 提出了一种名为Pipeline-Parallel Self-Speculative Decoding (PPSD) 的新方法,通过流水线并行化草稿生成与验证过程,显著提升了大语言模型推理效率,在多个基准测试中实现了2.01x~3.81x的加速比。

Details Motivation: 现有的早期退出自推测解码(EESD)方法在实际应用中难以实现预期加速,因为若草稿令牌被拒绝较多,其计算开销反而会抵消加速收益,导致负加速。 Method: 将模型层配置为流水线,使早期退出(草稿)计算与剩余层(验证)计算重叠,并逐token交错进行草稿生成与验证:当模型在最后几层验证当前token时,早期退出路径同时生成下一个token,实现‘边验证边起草’的流水线机制。 Result: PPSD在多种基准测试上实现了2.01x到3.81x的加速比,几乎达到了给定接受率和退出位置下的最优加速,且无计算资源浪费。 Conclusion: PPSD通过完全流水线化的自推测解码架构,有效解决了传统EESD中因预测失败导致的资源浪费问题,实现了当前最先进的自推测LLM推理加速性能。 Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.

[25] SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use

Changhyun Jeon,Jinhee Park,Jungwoo Choi,Keonwoo Kim,Jisu Kim,Minji Hong

Main category: cs.CL

TL;DR: 提出了一种面向韩语工具使用的角色分工小规模语言模型代理架构P-C-G(Planner-Caller-Generator),在保证性能的同时降低成本和延迟。

Details Motivation: 针对韩语环境中频繁的韩英混用导致工具执行失败的问题,设计专用于韩语工具调用的小规模语言模型代理架构。 Method: 采用Planner-Caller-Generator三层架构,分别负责规划、调用和生成;引入韩语优先的值策略以减少代码切换错误,并通过LLM-as-a-Judge协议在统一接口下进行多场景评估。 Result: 在单链、多链、缺参和缺函数等场景下,P-C-G在韩语查询和工具规范下表现出竞争性的工具使用准确率和端到端质量,同时减少了token消耗并保持可接受延迟。 Conclusion: 角色专业化的小规模语言模型是韩语工具使用代理的一种高性价比替代方案。 Abstract: We propose a small-scale language model (SLM) based agent architecture, Planner-Caller-Generator (P-C-G), optimized for Korean tool use. P-C-G separates planning, calling, and generation by role: the Planner produces an initial batch plan with limited on-demand replanning; the Caller returns a normalized call object after joint schema-value validation; and the Generator integrates tool outputs to produce the final answer. We apply a Korean-first value policy to reduce execution failures caused by frequent Korean-to-English code switching in Korean settings. Evaluation assumes Korean queries and Korean tool/parameter specifications; it covers single-chain, multi-chain, missing-parameters, and missing-functions scenarios, and is conducted via an LLM-as-a-Judge protocol averaged over five runs under a unified I/O interface. Results show that P-C-G delivers competitive tool-use accuracy and end-to-end quality while reducing tokens and maintaining acceptable latency, indicating that role-specialized SLMs are a cost-effective alternative for Korean tool-use agents.

[26] Meow: End-to-End Outline Writing for Automatic Academic Survey

Zhaoyu Ma,Yuan Shan,Jiahao Zhao,Nan Xu,Lei Wang

Main category: cs.CL

TL;DR: 本文提出了Meow,首个基于元数据的提纲生成框架,通过端到端方式从论文元数据中生成层次化、结构化的综述提纲,结合监督微调与强化学习,在8B推理模型上实现了高结构保真度和风格一致性的提纲生成。

Details Motivation: 随着学术论文数量激增,自动综述生成成为趋势,但现有方法将提纲生成视为模板化流程步骤,缺乏对主题的深入理解和细粒度风格控制,难以生成高质量提纲。 Method: 将提纲生成定义为从论文元数据生成层次化结构提纲的端到端任务;构建了来自arXiv、bioRxiv和medRxiv的高质量综述数据集;提出两阶段训练方法,结合监督微调和强化学习。 Result: 所提出的8B推理模型在结构保真度和风格连贯性方面表现出色,显著优于现有模板式方法。 Conclusion: Meow是首个元数据驱动的自动化提纲生成框架,能够高效生成忠实且结构良好的综述提纲,为自动化学术综述生成提供了新范式。 Abstract: As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.

[27] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models

Kangtao Lv,Haibin Chen,Yujin Yuan,Langming Liu,Shilei Liu,Yongwei Wang,Wenbo Su,Bo Zheng

Main category: cs.CL

TL;DR: 本文研究了在大语言模型预训练中注入过多领域知识导致的记忆崩溃现象,提出了一个基于模型规模的知识注入扩展定律,以预测最优的领域知识注入量。

Details Motivation: 在没有领域特定优化的情况下,大语言模型在专业基准上表现不佳且容易产生幻觉;然而,过度注入领域知识会导致灾难性遗忘,因此需要平衡知识注入的程度。 Method: 通过系统性实验观察不同模型规模下的记忆崩溃点,并提出一种知识注入扩展定律,利用小规模模型来预测大规模模型的最佳知识注入量。 Result: 发现了每个模型都存在一个关键崩溃点,超过该点知识保留能力会急剧下降,且这些崩溃点与模型规模呈一致的缩放关系;实验验证了所提扩展定律的有效性和通用性。 Conclusion: 所提出的知识注入扩展定律能够有效指导大语言模型中领域知识的注入,避免记忆崩溃并提升下游任务性能。 Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model's size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.

[28] A Pipeline to Assess Merging Methods via Behavior and Internals

Yutaro Sigris,Andreas Waldis

Main category: cs.CL

TL;DR: 本文提出了一个评估语言模型合并方法的新框架,不仅评估合并后模型在下游任务上的行为表现,还分析其内部语言能力。研究发现,合并后的模型在行为表现上通常介于两个父模型之间,但在形态和句法等语言现象的编码能力上可能超过父模型,且行为与内部评估之间的相关性较弱。

Details Motivation: 现有研究主要从行为层面分析模型合并效果,缺乏对模型内部机制的系统性理解,本文旨在通过综合评估行为与内部表征,全面揭示合并模型的能力与局限。 Method: 提出一种新的评估流程:首先合并多个父语言模型(如Qwen2.5系列中经过指令微调、数学和代码适配的模型),然后在MMLU等下游任务上评估其行为表现,并分析其内部编码的语言学能力(如形态和句法)。 Result: 合并模型的行为性能通常介于两个父模型之间,但其内部对语言现象(尤其是形态和句法)的编码能力有时优于父模型;行为表现与内部能力评估之间的排序相关性较弱。 Conclusion: 仅依赖行为评估不足以全面理解模型合并的效果,必须结合内部表征分析,以获得对合并模型能力与可靠性的更真实认识。 Abstract: Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.

[29] Do LLMs Encode Frame Semantics? Evidence from Frame Identification

Jayanth Krishna Chundru,Rudrashis Poddar,Jie Cao,Tianyu Jiang

Main category: cs.CL

TL;DR: 研究探讨了大语言模型是否编码了框架语义的潜在知识,重点是框架识别任务,发现模型在无显式监督的情况下能有效完成该任务,经FrameNet数据微调后性能进一步提升,并能生成语义连贯的框架定义。

Details Motivation: 探索大语言模型在无监督情况下是否具备框架语义的理解能力,特别是对框架识别这一核心任务的支持。 Method: 基于FrameNet资源,采用基于提示的推理方法评估模型表现,并通过在FrameNet数据上进行微调以评估任务特定训练的影响。 Result: 模型在无需显式监督的情况下即可有效执行框架识别;微调显著提升了领域内准确率,并在跨领域基准上表现出良好泛化能力;模型还能生成语义连贯的框架定义。 Conclusion: 大语言模型内部已内化了一定程度的框架语义知识,具备较强的框架识别能力和语义生成能力,微调可进一步增强其性能。 Abstract: We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model's internalized understanding of frame semantics.

[30] Confidence Calibration in Large Language Model-Based Entity Matching

Iris Kamsteeg,Juan Cardenas-Cartagena,Floris van Beers,Gineke ten Holt,Tsegaye Misikir Tashu,Matias Valdenegro-Toro

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在实体匹配中的置信度校准问题,通过温度缩放等方法减轻RoBERTa模型的过自信现象。

Details Motivation: 探索大型语言模型在实体匹配任务中置信度校准的有效性,以提高模型可靠性。 Method: 采用温度缩放、蒙特卡洛Dropout和集成方法对RoBERTa模型的置信度进行校准,并在Abt-Buy、DBLP-ACM、iTunes-Amazon和Company数据集上进行实验。 Result: 原始RoBERTa模型存在轻微过自信,预期校准误差为0.0043至0.0552;使用温度缩放可将校准误差降低最多23.83%。 Conclusion: 温度缩放能有效改善RoBERTa在实体匹配任务中的置信度校准,提升模型输出的可靠性。 Abstract: This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

[31] Uncertainty in Semantic Language Modeling with PIXELS

Stefania Radu,Marco Zullich,Matias Valdenegro-Toro

Main category: cs.CL

TL;DR: 该研究分析了18种语言和7种文字的基于像素的语言模型中的不确定性和置信度,发现这些模型在重建图像块时低估了不确定性,且不同文字(如拉丁文)影响不确定性水平。集成学习结合超参数调优在命名实体识别和问答任务中表现更优。

Details Motivation: 解决基于像素的语言模型中的词汇瓶颈问题,并探索其不确定性量化这一未解挑战。 Method: 采用蒙特卡洛Dropout、Transformer注意力机制和集成学习等方法,在三种语义挑战性任务中跨语言和文字分析不确定性。 Result: 基于像素的模型在重建图像块时低估了不确定性;文字类型影响不确定性,拉丁语系表现出较低的不确定性;集成学习在超参数调优后在16种语言的命名实体识别和问答任务中性能更佳。 Conclusion: 该研究揭示了基于像素语言模型中不确定性估计的局限性,并表明集成方法结合调优可提升多语言任务下的性能。 Abstract: Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in pixel-based language models across 18 languages and 7 scripts, all part of 3 semantically challenging tasks. This is achieved through several methods such as Monte Carlo Dropout, Transformer Attention, and Ensemble Learning. The results suggest that pixel-based models underestimate uncertainty when reconstructing patches. The uncertainty is also influenced by the script, with Latin languages displaying lower uncertainty. The findings on ensemble learning show better performance when applying hyperparameter tuning during the named entity recognition and question-answering tasks across 16 languages.

[32] Retrieval Augmented Generation based context discovery for ASR

Dimitrios Siskos,Stavros Papadopoulos,Pablo Peso Parada,Jisi Zhang,Karthikeyan Saravanan,Anastasios Drosou

Main category: cs.CL

TL;DR: 本文提出了一种基于嵌入的检索方法,用于在上下文感知的自动语音识别(ASR)系统中实现高效的自动上下文发现,以提升包含罕见词或未登录词时的转录准确率。

Details Motivation: 在ASR系统中,由于罕见词或未登录词的存在,转录准确率较低,而自动识别合适上下文仍是一个挑战。 Method: 提出一种基于嵌入的检索增强生成方法,并对比评估了两种基于大语言模型(LLM)的替代方案:通过提示生成上下文和使用LLM进行识别后修正。 Result: 在TED-LIUMv3、Earnings21和SPGISpeech数据集上的实验表明,该方法相较于无上下文情况,WER相对降低最多达17%,而使用理想上下文时最多可降低24.1%。 Conclusion: 基于嵌入的检索方法是一种高效且有效的自动上下文发现策略,在提升ASR系统对罕见词的识别能力方面具有显著优势。 Abstract: This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.

[33] ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

Aleksis Datseris,Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

Main category: cs.CL

TL;DR: 提出一种名为“精确位置嵌入”(ExPE)的新方法,能够有效外推到比训练时更长的序列,在因果语言建模中显著降低困惑度。

Details Motivation: 传统Transformer模型的位置嵌入在处理超出训练长度的序列时外推能力有限,限制了模型的泛化性能。 Method: 通过覆盖嵌入向量的特定维度来编码精确的位置信息,是一种可外推的绝对位置嵌入方法。 Result: 在因果语言建模任务中,ExPE在长序列上的表现优于旋转位置嵌入和正弦位置嵌入,显著降低了困惑度。 Conclusion: ExPE能有效提升模型对长序列的泛化能力,同时保持原始嵌入的完整性,是一种高效的位置嵌入方案。 Abstract: This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model's ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.

[34] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Yanfang,Ye,Zheyuan Zhang,Tianyi Ma,Zehong Wang,Yiyang Li,Shifu Hou,Weixiang Sun,Kaiwen Shi,Yijun Ma,Wei Song,Ahmed Abbasi,Ying Cheng,Jane Cleland-Huang,Steven Corcelli,Patricia Culligan,Robert Goulding,Ming Hu,Ting Hua,John Lalor,Fang Liu,Tengfei Luo,Ed Maginn,Nuno Moniz,Jason Rohr,Brett Savoie,Daniel Slate,Tom Stapleford,Matthew Webber,Olaf Wiest,Johnny Zhang,Nitesh Chawla

Main category: cs.CL

TL;DR: 本文综述了大语言模型(LLMs)在多个学科领域的应用,探讨了其对研究和实践的影响,并讨论了生成式AI时代的关键挑战与未来方向。

Details Motivation: 受LLMs在自然语言处理任务中卓越表现的启发,探索其跨学科应用潜力及其对现实世界的影响。 Method: 对LLMs在艺术、人文、法律、经济、商业、科学与工程等领域的最新应用进行系统性综述。 Result: 总结了LLMs在各学科中的应用现状、关键局限性和开放性挑战,提供了跨领域整合的见解。 Conclusion: LLMs具有广泛的应用前景,但需解决技术、伦理和实践层面的挑战以实现其全部潜力。 Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[35] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Dylan Hutson,Daniel Vennemeyer,Aneesh Deshmukh,Justin Zhan,Tianyu Jiang

Main category: cs.CL

TL;DR: 提出GuessingGame协议,通过信息增益(IG)度量评估大语言模型作为策略性提问者的能力,发现IG越高,提问效率越高,且基于IG的提示约束可显著提升弱模型表现。

Details Motivation: 现有方法难以在开放域、开放式环境中有效评估大语言模型的策略性提问能力,缺乏可量化、可改进的指标。 Method: 设计GuessingGame协议,由猜测者LLM向无预设选项的Oracle提问以识别隐藏对象;提出两种模型无关的信息增益(IG)度量:基于贝叶斯信念更新和基于熵结合ConceptNet的候选过滤方法。 Result: 在858轮游戏中验证,IG提升一个标准差可使游戏长度减少43%;基于IG的提示约束(如强制问题多样性)显著提升弱模型性能。 Conclusion: 大语言模型的提问能力是可衡量且可改进的,信息增益是有效的评估指标,对交互式推理具有重要意义。 Abstract: We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43\%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

[36] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models

Mohammad Saim,Phan Anh Duong,Cat Luong,Aniket Bhanderi,Tianyu Jiang

Main category: cs.CL

TL;DR: 提出了一种利用大型视觉-语言模型生成具身情感叙事(ELENA)的框架,通过关注情绪反应中的显著身体部位来分析多模态下的具身情绪。

Details Motivation: 具身情绪中来自身体各部位的情感反应包含丰富的情感体验信息,现有方法在非面部区域的情绪识别上仍有不足。 Method: 利用先进的大型视觉-语言模型(LVLM)生成多层次文本输出(ELENA),结合注意力图分析模型对身体部位的关注分布,并测试其在遮挡面部图像中的情绪识别能力。 Result: 尽管模型存在偏向面部区域的注意力偏差,该框架在无需微调的情况下仍能有效识别遮挡面部图像中的具身情绪,并优于基线方法。 Conclusion: ELENA为跨视觉模态的具身情绪分析提供了新路径,增强了情感感知场景中的建模能力。 Abstract: The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.

[37] Evaluating Language Translation Models by Playing Telephone

Syeda Jannatus Saba,Steven Skiena

Main category: cs.CL

TL;DR: 提出一种无监督方法生成翻译评估训练数据,通过在源语言和目标语言之间反复翻译,提升评估系统在长文本和文学翻译等任务上的性能。

Details Motivation: 当前语言模型的有效性超过了机器翻译质量评估能力,限制了在复杂任务(如长篇和文学翻译)上的进一步改进。 Method: 采用无监督方法,通过多轮源语言与目标语言间的回译生成不同长度和领域的翻译评估训练数据,并比较模型轮换和语言翻译两种方式生成文本的效果。 Result: 在两个任务上验证了该方法的有效性:(i) 对给定翻译相对于人工参考译文的质量打分;(ii) 判断两个译文中哪一个更接近原始源文档。结果优于流行的xCOMET系统。 Conclusion: 所提出的无监督数据生成方法能有效提升翻译评估系统的性能,尤其适用于缺乏人工标注数据的复杂翻译场景。 Abstract: Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models--which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.

[38] AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification

Ryan Shea,Zhou Yu

Main category: cs.CL

TL;DR: 本文提出了一种名为AutoSpec的自动化专利说明书撰写框架,利用开源语言模型和定制工具分解任务,解决了专利撰写中的保密性和技术复杂性问题,并通过专家评估验证了其优于现有基线的效果。

Details Motivation: 专利申请撰写成本高、耗时长,且涉及敏感信息,难以利用闭源大模型进行自动化处理;同时现有语言模型在长文本、技术写作风格和专业领域知识方面存在局限,因此需要一种安全且高效的自动化专利撰写方案。 Method: 提出AutoSpec框架,将专利撰写过程分解为多个可管理的子任务,每个子任务由配备定制工具的小型开源语言模型完成,实现安全、分布式的专利说明书自动生成。 Result: 通过与资深专利律师合作设计的新评估协议,AutoSpec在自动评估和专家评估中均优于现有基线方法,展现出更高的生成质量和实用性。 Conclusion: AutoSpec为自动化专利撰写提供了一个安全、有效且可行的解决方案,推动了语言模型在高度专业化和技术性文档生成任务中的应用。 Abstract: Patents play a critical role in driving technological innovation by granting inventors exclusive rights to their inventions. However the process of drafting a patent application is often expensive and time-consuming, making it a prime candidate for automation. Despite recent advancements in language models, several challenges hinder the development of robust automated patent drafting systems. First, the information within a patent application is highly confidential, which often prevents the use of closed-source LLMs for automating this task. Second, the process of drafting a patent application is difficult for even the most advanced language models due to their long context, technical writing style, and specialized domain knowledge. To address these challenges, we introduce AutoSpec, a secure, agentic framework for Automatically drafting patent Specification. Our approach decomposes the drafting process into a sequence of manageable subtasks, each solvable by smaller, open-source language models enhanced with custom tools tailored for drafting patent specification. To assess our system, we design a novel evaluation protocol in collaboration with experienced patent attorneys. Our automatic and expert evaluations show that AutoSpec outperforms existing baselines on a patent drafting task.

[39] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections

Yicheng Yang,Zixian Li,Jean Paul Bizimana,Niaz Zafri,Yongfeng Dong,Tianyi Li

Main category: cs.CL

TL;DR: 本文提出利用多模态大语言模型(LLMs)结合新型提示设计,以提高对驾驶员让行行为的可解释性和情境感知推断能力,从而更好地建模行人-驾驶员交互。实验表明GPT-4o在准确率和召回率上表现最佳,Deepseek-V3在精确率上突出,揭示了模型性能与计算效率之间的权衡。

Details Motivation: 传统机器学习模型在捕捉交叉口行人与驾驶员交互的复杂、情境依赖行为方面存在局限,缺乏可解释性且特征表示固定,难以满足实际需求。 Method: 采用融合领域知识、结构化推理和少样本提示的新型提示设计,利用多模态大语言模型进行驾驶员让行行为建模,并与传统分类器进行对比实验。 Result: GPT-4o在准确率和召回率上表现最优,Deepseek-V3在精确率上表现最好,验证了LLMs在交通交互建模中的优势及其性能与计算效率间的权衡。 Conclusion: 大语言模型通过合理的提示工程能有效建模复杂的行人-车辆交互行为,在准确性与可解释性之间取得平衡,为实际行人安全系统部署提供了可行方案与指导。 Abstract: Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver--pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian--driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.

[40] DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems

Shuyu Zhang,Yifan Wei,Jialuo Yuan,Xinru Wang,Yanmin Zhu,Bin Li

Main category: cs.CL

TL;DR: 提出DyBBT,一种基于认知状态空间的动态对话策略学习框架,通过结合直觉与深思推理机制实现高效探索,在多任务对话系统中取得最优性能。

Details Motivation: 现有任务导向对话系统依赖静态探索策略,无法适应动态对话上下文,导致探索效率低和性能不佳。 Method: 构建一个结构化的认知状态空间来建模对话进程、用户不确定性和槽位依赖,并设计受多臂赌博机启发的元控制器,根据实时认知状态和访问次数在快速直觉推理(系统1)和慢速深思推理(系统2)之间动态切换。 Result: 在单域和多域基准上的实验表明,DyBBT在成功率、效率和泛化能力方面均达到最先进水平,且人类评估验证其决策与专家判断高度一致。 Conclusion: DyBBT通过动态调整探索策略,有效提升了任务导向对话系统的性能,展示了将双过程理论引入对话策略学习的潜力。 Abstract: Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.

[41] Personality Vector: Modulating Personality of Large Language Models by Model Merging

Seungjong Sun,Seo Yeon Baek,Jang Hyun Kim

Main category: cs.CL

TL;DR: 提出一种通过模型合并实现大语言模型人格调制的新方法,利用人格向量实现无需额外训练的人格控制与组合,并具有跨模型迁移能力。

Details Motivation: 现有方法难以捕捉人类人格的连续性和多维性,需更灵活的人格建模方式。 Method: 通过从微调模型权重中减去预训练模型权重构建人格向量,再通过合并这些人格向量实现对LLM人格特征的调制。 Result: 实验证明人格向量可实现对人格特质强度的连续控制,支持多特质组合,且在不同下游模型间具有可迁移性。 Conclusion: 人格向量是一种有效、通用且无需再训练的人格调制手段,为个性化AI系统提供了新路径。 Abstract: Driven by the demand for personalized AI systems, there is growing interest in aligning the behavior of large language models (LLMs) with human traits such as personality. Previous attempts to induce personality in LLMs have shown promising results, but they struggle to capture the continuous and multidimensional nature of human traits. In this work, we propose a novel method for personality modulation in LLMs via model merging. Specifically, we construct personality vectors by subtracting the weights of a pre-trained model from those of the fine-tuned model on a given personality trait. By merging personality vectors, we enable LLMs to exhibit desired personality traits without additional training. Extensive experiments show that personality vectors enable continuous control over trait intensity and support the composition of multiple traits. Furthermore, personality vectors transfer across diverse downstream models, suggesting that they encode generalizable representations of personality. Our code is available at here.

[42] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

Shuyu Zhang,Yifan Wei,Xinru Wang,Yanmin Zhu,Yangfan He,Yixuan Weng,Bin Li

Main category: cs.CL

TL;DR: 提出HiCoLoRA框架以解决零样本对话状态追踪中的语义错位问题,在MultiWOZ和SGD数据集上达到SOTA性能。

Details Motivation: 解决零样本对话状态追踪中动态对话上下文与静态提示之间的语义错位问题,避免领域干扰和灾难性遗忘。 Method: 设计分层LoRA架构,结合谱联合域-槽聚类和自适应线性融合机制,并采用语义增强的SVD初始化(SemSVD-Init)来保持预训练知识。 Result: 在MultiWOZ和SGD多领域数据集上实验表明,HiCoLoRA优于基线模型,实现了零样本DST的最先进性能。 Conclusion: HiCoLoRA通过层级协作低秩适配有效提升了零样本槽位推理能力,具备良好的跨领域泛化性和稳定性。 Abstract: Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.

[43] PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs

Pei Zhang,Andong Chen,Xi Chen,Baosong Yang,Derek F. Wong,Fei Huang

Main category: cs.CL

TL;DR: 提出了一种名为PART的多阶段多任务框架,用于解决多语言语音与文本表示对齐的问题,通过动态激活大语言模型参数并引入基于文本的任务,显著提升了多语言语音理解性能。

Details Motivation: 现有的多语言语音-文本对齐方法通常冻结大语言模型参数,导致跨语言对齐效果受限,难以兼顾语言特异性与通用性。 Method: 采用分阶段训练策略,将语言内对齐与跨语言对齐分离;在跨语言训练中动态激活LLM参数,并在后期引入文本任务以增强多语言理解能力。 Result: 在CommonVoice 15、Fleurs、Wenetspeech和CoVoST2等多个数据集上实验表明,PART优于传统方法,能更好平衡语言特异性与跨语言泛化能力。 Conclusion: PART框架有效提升了多语言语音大模型中的语音-文本表示对齐性能,具有良好的通用性和应用前景。 Abstract: Large language models (LLMs) have expanded from text to speech, giving rise to Speech Large Models (SLMs) that support recognition, translation, and synthesis. A key challenge is aligning speech and text representations, which becomes harder in multilingual settings. Existing methods often freeze LLM parameters and train encoders on multilingual data, but this forces cross-language convergence and limits performance. We introduce Progressive Alignment Representation Training (PART), a multi-stage and multi-task framework that separates within-language from cross-language alignment. During cross-language training, LLM parameters are dynamically activated, and text-based tasks are later introduced to enhance multilingual understanding. Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization. These results demonstrate PART's effectiveness and generality for multilingual speech modality alignment.

[44] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Sina J. Semnani,Han Zhang,Xinyan He,Merve Tekgürler,Monica S. Lam

Main category: cs.CL

TL;DR: 本文提出了CHURRO,一个专为历史文本识别设计的30亿参数视觉-语言模型,并基于目前最大的历史文本数据集CHURRO-DS进行训练,在印刷和手写文本识别上均优于现有模型,且成本更低。

Details Motivation: 现有的视觉-语言模型主要针对现代标准文本,难以处理历史文献中的多样语言、不规则版式和严重退化问题,因此需要专门模型来提升文化遗产的数字化与研究效率。 Method: 提出CHURRO模型,并构建包含155个历史语料库、99,491页文本的大型数据集CHURRO-DS用于训练和评估,涵盖46种语言群和22个世纪的历史文本。 Result: 在CHURRO-DS测试集上,CHURRO在印刷体和手写体文本识别中分别达到82.3%和70.1%的归一化Levenshtein相似度,优于Gemini 2.5 Pro等模型,并且成本低15.5倍。 Conclusion: CHURRO在历史文本识别方面显著优于现有模型,开源模型和数据集有助于推动文化遗产的可读性研究和学术进展。 Abstract: Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.

[45] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation

Sen Yang,Yu Bao,Yu Lu,Jiajun Chen,Shujian Huang,Shanbo Cheng

Main category: cs.CL

TL;DR: 提出一种基于合成数据生成的框架,利用大语言模型已有的英译能力,提升非英语间的直接翻译性能。

Details Motivation: 大语言模型在英语中心的语言对翻译中表现良好,但在非英语直接翻译(x2x)上表现不佳,需要改进。 Method: 通过扩展英语文本平行语料库构建全向多语言数据集,并设计基于英语参考的翻译质量评估代理,结合偏好优化方法,生成高质量的x2x训练数据。 Result: 在72个非英语翻译方向上显著提升了主流大语言模型的翻译性能,同时反向增强了英译外(en2x)的表现。 Conclusion: 有效利用英语中心的优势可引导大语言模型实现更全面的多语言翻译能力。 Abstract: Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models' established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX

[46] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Wence Ji,Jiancan Wu,Aiying Li,Shuyi Zhang,Junkang Wu,An Zhang,Xiang Wang,Xiangnan He

Main category: cs.CL

TL;DR: 提出了一种名为bi-GRPO的新型强化学习框架,用于在大语言模型中注入越狱后门攻击,相较于现有方法在有效性、隐蔽性和响应可用性方面表现更优。

Details Motivation: 现有越狱触发注入方法存在泛化能力差、隐蔽性不足或上下文可用性降低等问题,亟需一种更有效且鲁棒的方法。 Method: 提出bi-GRPO,一种基于强化学习的双向组相对策略优化框架,采用成对 rollout 和成对奖励机制,结合基于规则的奖励及长度与格式激励,无需高质量监督数据或复杂奖励模型。 Result: 实验表明,bi-GRPO攻击成功率超过99%,在无触发时保持模型安全性,生成的越狱响应具有高连贯性和实用性,显著优于现有方法。 Conclusion: bi-GRPO为大语言模型的越狱后门注入提供了高效、隐蔽且实用的新方案,推动了该领域技术的发展。 Abstract: With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

[47] Polarity Detection of Sustainable Detection Goals in News Text

Andrea Cadeddua,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi

Main category: cs.CL

TL;DR: 本文提出了可持续发展目标(SDG)极性检测的新任务,旨在判断文本对特定SDG的影响是积极、中性还是消极,并发布了用于该任务的基准数据集SDG-POD。通过六种大语言模型的零样本和微调实验,发现当前模型在该任务上仍具挑战,但QWQ-32B等微调模型表现良好,尤其在SDG-9、SDG-12和SDG-15上。合成数据增强被证明可有效提升模型性能。

Details Motivation: 尽管已有研究利用大语言模型自动分类文本与可持续发展目标(SDG)的相关性,但在实际应用中还需判断影响的方向(正面或负面),因此需要开展SDG极性检测以更精确评估可持续发展进展。 Method: 提出SDG极性检测任务,构建包含真实与合成数据的基准数据集SDG-POD,并对六种先进大语言模型进行零样本和微调评估,同时探索合成数据增强对模型性能的影响。 Result: 当前大语言模型在SDG极性检测任务上整体表现仍有挑战,但经过微调的模型(如QWQ-32B)在多个SDG目标上表现良好,尤其是在SDG-9、SDG-12和SDG-15上;加入合成数据能显著提升模型性能。 Conclusion: SDG极性检测是一项具有现实意义但具挑战性的新任务,微调大模型结合合成数据增强是有效的解决方案,有助于提升可持续发展监测的方法体系和技术能力。 Abstract: The United Nations' Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.

[48] TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios

Ji Yin,Menglan He,Yujie Zhang,Linshuai Zhang,Tingting Ma,Ce Tian,Jie Wu,Lin Xu,Tao Jiang

Main category: cs.CL

TL;DR: 本文提出了一个专门针对中医领域的大型语言模型TianHui,通过整合上下文数据和领域知识,并采用两阶段训练策略,在多个基准测试中表现优异,实现了中医知识的系统性保护和可扩展应用。

Details Motivation: 中医领域的大型语言模型在研究中受限于适应性不足、评估数据集缺乏和计算资源有限等问题,因此需要构建一个专门的、高效的中医语言模型以推动该领域的发展。 Method: 构建了一个大规模的中医语料库(包含0.97GB无监督数据和611,312个问答对),并采用QLoRA、DeepSpeed Stage 2和Flash Attention 2进行两阶段训练,优化了LoRA秩、学习轮数等超参数。 Result: 在12个基准测试中,TianHui在六个数据集的所有指标中排名前三,在其余六个数据集中也取得了最佳结果,最优配置为LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048。 Conclusion: TianHui有效提升了中医领域语言模型的性能,支持中医知识的系统保存与广泛应用,且所有资源已开源。 Abstract: Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.

[49] Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

Sujoy Sarkar,Gourav Sarkar,Manoj Balaji Jagadeeshan,Jivnesh Sandhan,Amrith Krishna,Pawan Goyal

Main category: cs.CL

TL;DR: Mahānāma是首个用于梵语端到端实体发现与链接的大规模数据集,源自《摩诃婆罗多》,包含超过10.9万个命名实体提及和5500个唯一实体,对现有实体解析方法构成挑战。

Details Motivation: 文学文本中存在词汇变异性强、指代模糊和长距离依赖等问题,使得实体解析尤为困难,尤其在形态丰富但资源稀缺的梵语中缺乏大规模数据集。 Method: 基于《摩诃婆罗多》构建名为Mahānāma的数据集,标注实体提及并链接至5.5K唯一实体,且与英文知识库对齐以支持跨语言链接。 Result: 评估显示当前共指消解与实体链接模型在测试集全局上下文上表现不佳,暴露出在复杂叙事结构下的局限性。 Conclusion: Mahānāma为推进文学领域尤其是低资源语言中的实体解析研究提供了独特基准。 Abstract: High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mah\={a}n\={a}ma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mah\={a}bh\={a}rata, the world's longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mah\={a}n\={a}ma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mah\=an\=ama thus provides a unique benchmark for advancing entity resolution, especially in literary domains.

[50] Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu,Bin Zhu,Xiandong Zou,Qiquan Zhang,Xu Fang,Pan Zhou

Main category: cs.CL

TL;DR: 本文提出了针对语音大语言模型(Speech LLMs)的“煤气灯攻击”(gaslighting attacks),通过五种操纵策略评估模型在语音交互中的脆弱性,实验显示现有模型在面对此类攻击时存在显著的行为漏洞。

Details Motivation: 随着语音大语言模型在语音应用中的广泛使用,其对恶意输入的鲁棒性变得至关重要。然而,语音特有的模糊性、连续性和感知多样性使得对抗攻击更难检测,现有研究对此探索不足。 Method: 提出五种操纵策略:愤怒、认知干扰、讽刺、隐式否定和专业否定,并构建测试框架以评估语音及多模态大模型在多种任务下的性能下降与异常行为响应;同时进行声学扰动实验以检验多模态鲁棒性。 Result: 在5个语音及多模态大模型、超过10,000个测试样本上的实验表明,五类煤气灯攻击导致平均准确率下降24.3%,并引发不必要的道歉或拒绝等行为反应。 Conclusion: 当前语音大语言模型在面对精心设计的语义操纵时表现出显著脆弱性,亟需构建更具韧性与可信度的语音AI系统。 Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

[51] SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

Alba Maria Marmol-Romero,Manuel Garcia-Vega,Miguel Angel Garcia-Cumbreras,Arturo Montejo-Raez

Main category: cs.CL

TL;DR: SINAI-UJA团队在eRisk@CLEF 2025的两项任务中表现出色:在上下文相关的抑郁症早期检测任务中排名第8,但在预测速度上具有优势;在基于大模型的对话式抑郁检测试点任务中获得第1名,验证了结构化对话设计与大模型结合的有效性。

Details Motivation: 参与eRisk@CLEF 2025实验室,探索基于文本和对话的抑郁症早期检测方法,特别是在多用户对话场景和大语言模型驱动的交互环境下的可行性与性能。 Method: 对于任务2,采用预处理流程结合RoBERTa Base和MentalRoBERTa Large等Transformer模型捕捉多用户对话的上下文和序列特征;对于试点任务,设计了最大化信息获取的对话策略,与LLM驱动的人格化代理进行交互。 Result: 在任务2中F1得分排名第8(共12队),但具备最快的早期预测能力;在试点任务中所有指标(DCHR、ADODL、ASHR)均排名第一(共5队)。 Conclusion: 结构化对话策略结合大语言模型在抑郁检测中表现优异,尤其在有限对话轮次下有效;同时揭示了早期检测速度与分类准确率之间的权衡,为未来联合优化提供方向。 Abstract: This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.

[52] SwissGPC v1.0 -- The Swiss German Podcasts Corpus

Samuel Stucki,Mark Cieliebak,Jan Deriu

Main category: cs.CL

TL;DR: SwissGPC v1.0是一个大规模自发瑞士德语语音语料库,包含约5400小时原始音频,经处理后保留近5000小时,覆盖七种主要方言区域和标准德语,适用于ASR、TTS和方言识别等研究。

Details Motivation: 现有瑞士德语语音语料库多为受控语音,缺乏自然对话数据,限制了真实场景下语音技术的研究与应用。 Method: 收集来自Schweizer Radio und Fernsehen和YouTube的脱口秀和播客链接,通过自动注释流水线进行音频分割和弱标注,构建语料库并统计方言分布、词元数量和分割特征。 Result: 建成包含近5000小时标注语音的SwissGPC v1.0语料库,覆盖七种主要瑞士德语方言区域和标准德语,显著超过现有语料库规模。 Conclusion: SwissGPC v1.0是首个中大规模自发瑞士德语语音语料库,为语音识别、合成及方言识别等真实应用场景提供了重要资源。 Abstract: We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.

[53] Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation

Wei-Hsiang Lin,Sheng-Lun Wei,Hen-Hsen Huang,Hsin-Hsi Chen

Main category: cs.CL

TL;DR: 研究探讨了LLM生成与判断能力之间的关系,发现二者仅弱相关,并提出一种自参考引导的评估策略以增强其相关性。

Details Motivation: 解决现有研究中关于LLM生成与判断能力关系不一致的问题。 Method: 通过在11个模型和21个多样化任务上的数据集和实例级分析,提出自参考引导的评估方法。 Result: 发现生成与判断能力之间仅弱相关,主要原因是LLM对被评判的回答敏感;引入自参考策略后显著增强了两者相关性。 Conclusion: 自参考引导评估能有效对齐生成与判断能力,为模型选择提供可靠依据。 Abstract: LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.

[54] Future Policy Aware Preference Learning for Mathematical Reasoning

Minjae Oh,Yunho Choi,Dongmin Choi,Yohan Jo

Main category: cs.CL

TL;DR: 提出Future Policy Aware (FPA) 偏好学习方法,通过在正则化项中使用未来策略估计来避免对共享有用token的过度惩罚,从而提升数学推理任务中的模型性能。

Details Motivation: 现有偏好学习方法(如DPO)在数学推理中因偏好与非偏好路径间大量token重叠而导致过度惩罚,引发性能下降。 Method: FPA用轻量级logit空间外推估计从参考模型到当前模型的未来策略,并将其用于正则化项,以提前调节梯度,防止有用token概率下降。 Result: 在MATH和GSM8K基准上,FPA应用于DPO、RPO和SimPER均带来一致性能提升,其中SimPER最大提升达5.75%,并支持更长且无退化的训练过程。 Conclusion: FPA通过前瞻性正则化有效保护共享有用token,显著提升数学推理任务的训练稳定性与性能,且计算开销可忽略。 Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

[55] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Binbin Zhang,Chengdong Liang,Shuai Wang,Xuelong Geng,Zhao Guo,Haoyu Li,Hao Yin,Xipeng Yang,Pengshen Zhang,Changwei Ma,Lei Xie

Main category: cs.CL

TL;DR: 本文介绍了WEST(WE Speech Toolkit),一个基于大语言模型的全栈式语音工具包,支持语音识别、合成、理解和对话等任务,并提供开源和高性能两种版本。

Details Motivation: 为了利用大语言模型的优势,构建一个简单易用、功能全面且可扩展的语音处理工具包。 Method: 采用完全基于大语言模型的架构,复用成熟的模型生态(如Hugging Face)和训练方法(如序列打包),并设计全栈功能以支持多种语音任务。 Result: 提供了两套方案:一套基于开源模型和数据,可完全复现实验;另一套基于大规模数据训练,性能优越,可开箱即用。 Conclusion: WEST是一个简单、开放、功能完整的语音工具包,旨在降低语音技术使用门槛,促进研究与应用的发展。 Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

[56] CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

Soham Bhattacharjee,Mukund K Roy,Yathish Poojary,Bhargav Dave,Mihir Raj,Vandan Mujadia,Baban Gain,Pruthwik Mishra,Arafat Ahsan,Parameswari Krishnamurthy,Ashwath Rao,Gurpreet Singh Josan,Preeti Dubey,Aadil Amin Kak,Anna Rao Kulkarni,Narendra VG,Sunita Arora,Rakesh Balbantray,Prasenjit Majumdar,Karunesh K Arora,Asif Ekbal,Dipti Mishra Sharma

Main category: cs.CL

TL;DR: 本文介绍了CorIL,一个大规模、高质量的印度语言平行语料库,涵盖11种语言(包括英语和10种印度官方语言),共77.2万句对,覆盖政府、健康和通用领域。该语料库用于推动多语言神经机器翻译研究,并针对不同脚本(如波斯-阿拉伯文与印度文)和领域进行性能分析,为未来研究提供了重要资源和基准。

Details Motivation: 由于印度语言多样且缺乏高质量、跨领域的平行语料,现有的多语言机器翻译模型在这些语言上的发展受限。因此,亟需一个系统性标注的大规模多领域平行语料库来支持印度语言的机器翻译研究与领域适应。 Method: 构建了一个名为CorIL的大规模平行语料库,包含11种语言的77.2万句对,涵盖政府、健康和通用三个领域。语料经过精心整理与分类,并基于该语料库微调和评估了多个最先进的NMT模型(如IndicTrans2、NLLB、BhashaVerse),进行了按语言脚本和领域的详细性能分析。 Result: 实验表明,不同NMT模型在不同语言脚本上表现各异:多语言模型在波斯-阿拉伯文脚本(如乌尔都语、信德语)上表现更优,而其他模型在印度文脚本上更具优势。此外,提供了按领域划分的性能基准,揭示了模型的领域敏感性和跨脚本迁移能力。 Conclusion: CorIL语料库填补了印度语言高质量平行数据的空白,显著提升了训练数据的可用性。其领域分类和多语言覆盖特性为机器翻译研究提供了有力支持,尤其有助于领域适应和跨脚本学习的研究,作者已公开发布该语料库以促进社区发展。 Abstract: India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.

[57] The Knowledge-Behaviour Disconnect in LLM-based Chatbots

Jan Broersen

Main category: cs.CL

TL;DR: 本文探讨了基于大语言模型(LLM)的对话代理(如ChatGPT)虽然能给出正确回答,看似具备知识,但其行为并未真正基于所掌握的知识,存在一种“脱节”现象。作者认为这种脱节是根本性的,源于LLM训练方法本身的局限,无法通过更多数据或训练消除,并解释了该问题与幻觉现象的关系。此外,文章还讨论了伦理层面的脱节及其加剧风险。

Details Motivation: 尽管大语言模型在问答中表现良好,常被视为拥有知识,但其行为是否真正基于这些知识仍存疑。作者旨在揭示模型知识与行为之间的根本性脱节,并探究其成因与影响,特别是对伦理行为的影响。 Method: 通过哲学分析与概念论证,作者提出并阐述了‘知识-行为脱节’的概念,分析LLM训练机制(如统计学习)为何无法建立知识与行为间的实质性联系,并评估现有行为调控技术在解决该问题上的无效性甚至负面影响。 Result: 论证得出:LLM中的知识与其输出行为之间存在根本性脱节,这种脱节由训练方法决定,无法通过常规扩展解决;该脱节是幻觉产生的根源之一;现有的伦理对齐技术未能解决此核心问题,反而可能加剧脱节。 Conclusion: 大语言模型不具备将知识作为行为基础的能力,其知识与行为间的脱节是结构性、根本性的限制,这不仅影响模型可靠性,也对伦理对齐构成深层挑战。 Abstract: Large language model-based artificial conversational agents (like ChatGPT) give answers to all kinds of questions, and often enough these answers are correct. Just on the basis of that capacity alone, we may attribute knowledge to them. But do these models use this knowledge as a basis for their own conversational behaviour? I argue this is not the case, and I will refer to this failure as a `disconnect'. I further argue this disconnect is fundamental in the sense that with more data and more training of the LLM on which a conversational chatbot is based, it will not disappear. The reason is, as I will claim, that the core technique used to train LLMs does not allow for the establishment of the connection we are after. The disconnect reflects a fundamental limitation on the capacities of LLMs, and explains the source of hallucinations. I will furthermore consider the ethical version of the disconnect (ethical conversational knowledge not being aligned with ethical conversational behaviour), since in this domain researchers have come up with several additional techniques to influence a chatbot's behaviour. I will discuss how these techniques do nothing to solve the disconnect and can make it worse.

[58] DiffNator: Generating Structured Explanations of Time-Series Differences

Kota Dohi,Tomoya Nishida,Harsh Purohit,Takashi Endo,Yohei Kawaguchi

Main category: cs.CL

TL;DR: DiffNator是一个用于生成两个时间序列差异的结构化解释的框架,结合时间序列编码器与冻结的大型语言模型,以JSON格式输出解释。

Details Motivation: 在许多物联网应用中,关注的是传感器信号之间的差异,但解释这些差异需要专家知识,因此需要自动化工具来辅助理解。 Method: 设计了一个JSON模式来描述时间序列差异的关键属性,并利用TORI数据集生成配对序列,训练一个结合时间序列编码器和冻结LLM的模型以生成结构化解释。 Result: 实验结果表明,DiffNator在生成准确的差异解释方面显著优于视觉问答基线和基于预训练时间序列编码器的检索方法。 Conclusion: DiffNator能有效生成可解释的时间序列差异描述,为非专家用户提供直观、结构化的分析支持。 Abstract: In many IoT applications, the central interest lies not in individual sensor signals but in their differences, yet interpreting such differences requires expert knowledge. We propose DiffNator, a framework for structured explanations of differences between two time series. We first design a JSON schema that captures the essential properties of such differences. Using the Time-series Observations of Real-world IoT (TORI) dataset, we generate paired sequences and train a model that combine a time-series encoder with a frozen LLM to output JSON-formatted explanations. Experimental results show that DiffNator generates accurate difference explanations and substantially outperforms both a visual question answering (VQA) baseline and a retrieval method using a pre-trained time-series encoder.

[59] Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Vani Kanjirangat,Tanja Samardžić,Ljiljana Dolamic,Fabio Rinaldi

Main category: cs.CL

TL;DR: 本文研究了预训练多语言模型中的表示偏差(如分词一致性TP和信息一致性IP)对下游任务性能的影响,发现TP更影响依赖句法和形态线索的任务(如抽取式问答),而IP更影响语义任务(如主题分类)。

Details Motivation: 尽管数据规模、社会经济等因素被提出解释方言数据建模的困难,但其影响不一致,因此需要更直接地探究影响模型性能的根本因素。 Method: 通过相关性分析,将分词一致性(TP)和信息一致性(IP)与三种下游任务(方言分类、主题分类、抽取式问答)的性能进行比较,并控制书写系统(拉丁/非拉丁)和资源丰富度(高/低)变量,涵盖解码器-only大模型和编码器模型。 Result: TP是语法/形态任务性能的更好预测指标,IP则是语义任务的更好预测指标;进一步分析显示LLM的语言支持声明可能掩盖了在脚本或分词层面的深层不匹配。 Conclusion: 应关注模型在分词和信息表示层面的偏差,而非仅依赖语言支持声明,以提升对方言等变异语言的建模效果。 Abstract: Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.

[60] Responsible AI Technical Report

KT,:,Soonmin Bae,Wanjin Park,Jeongyeop Kim,Yunjin Park,Jungwon Yoon,Junhyung Moon,Myunggyo Oh,Wonhyuk Lee,Junseo Jang,Dongyoung Jung,Minwook Ju,Eunmi Kim,Sujin Kim,Youngchol Kim,Somin Lee,Wonyoung Lee,Minsung Noh,Hyoungjun Park,Eunyoung Shin

Main category: cs.CL

TL;DR: 本文介绍了KT公司开发的负责任AI(RAI)评估方法和风险缓解技术,旨在确保AI服务的安全性和可靠性。

Details Motivation: 为了应对全球AI治理趋势并符合《人工智能基本法》的要求,确保AI从开发到运营全过程的安全合规。 Method: 基于KT的AI风险分类体系,建立了一套适用于国内环境的合规性评估方法,并开发了实时防护工具SafetyGuard。 Result: 提出了一套系统化的AI安全与鲁棒性评估方法,并发布了名为SafetyGuard的专有护栏工具,可实时阻止AI模型的有害输出。 Conclusion: 该研究成果为希望发展负责任AI的组织提供了有价值的参考,有助于提升国内AI开发生态的安全性。 Abstract: KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT's AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.

[61] From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors

Maggie Mi,Aline Villavicencio,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 提出一种仅基于输入的轻量级方法,通过词元级别的似然特征预测语言模型在处理习语、隐喻或上下文敏感输入时的失败。

Details Motivation: 语言模型常因误解输入而导致错误输出,现有方法多依赖输出或隐藏状态,缺乏在生成前有效预测错误的手段。 Method: 利用受‘惊讶度’和‘均匀信息密度假设’启发的词元级别似然特征,在不访问输出或隐藏激活的情况下,捕捉输入理解中的局部不确定性。 Result: 该方法在五个语言挑战性数据集上优于标准基线;大模型受益于局部跨度特征,小模型更适应全局模式。 Conclusion: 所提输入端方法无需生成即可预测潜在错误,具有良好的通用性和可扩展性,为预生成错误检测提供了新思路。 Abstract: Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.

[62] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu,Xueyi Li,Hao Wang,Haoxuan Li,Zhichao Chen,Weiqi Luo,Zitao Liu

Main category: cs.CL

TL;DR: 提出TtT框架,结合自回归文本生成与非自回归音频扩散,统一建模语音-文本交互,降低训练成本。

Details Motivation: 现有多模态模型在处理交错的音频和文本时需要复杂的多阶段训练,且未考虑文本和音频token依赖结构的不对称性。 Method: 在单个Transformer架构中集成自回归文本生成和非自回归音频扩散,并基于预训练大语言模型初始化。 Result: 实现了高效的统一音频-文本建模,减少了计算开销,同时利用了文本的因果依赖和音频的源-目标依赖特性。 Conclusion: TtT框架通过融合不同生成机制,在保持性能的同时简化了多模态训练流程,为语音对话系统提供了更高效的解决方案。 Abstract: Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.

[63] Can Constructions "SCAN" Compositionality ?

Ganesh Katrapati,Manish Shrivastava

Main category: cs.CL

TL;DR: 本文提出一种无监督的伪构式挖掘方法,通过从训练数据中自动提取变槽模板来提升序列到序列模型的组合性和系统性泛化能力,在SCAN数据集上显著提高了分布外分割的准确率,并展现出强数据效率。

Details Motivation: 序列到序列模型在组合性和系统性泛化方面表现不佳,作者认为这是因为模型未能内化构式(形式-意义配对),从而限制了其创造性重组能力。 Method: 提出一种无监督方法,从训练数据中自动提取带有变量槽的伪构式模板,并将其用于模型输入的预处理,无需改变模型结构或增加监督信号。 Result: 在SCAN数据集的分布外任务上性能大幅提升:ADD JUMP准确率达到47.8%,AROUND RIGHT达到20.3%;同时仅用40%的原始训练数据即可达到具有竞争力的性能,显示出良好的数据效率。 Conclusion: 构式感知的预处理是一种有效且高效的方法,可作为复杂架构修改或训练策略调整的可行替代方案,有助于提升模型的系统性泛化能力。 Abstract: Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks. We attribute this limitation to their failure to internalise constructions conventionalised form meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, our method yields large gains out-of-distribution splits: accuracy rises to 47.8 %on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with? 40% of the original training data, demonstrating strong data efAciency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.

[64] OLaPh: Optimal Language Phonemizer

Johannes Wirth

Main category: cs.CL

TL;DR: 本文提出了OLaPh框架,结合大型词典、多种NLP技术和复合词解析,通过概率评分函数提升文本到音素转换的准确性,并利用OLaPh生成的数据训练大语言模型以进一步提高性能。

Details Motivation: 传统音素转换方法在处理姓名、外来词、缩写和同形异义词时表现不佳,需要更鲁棒的方法来提升跨领域词汇的准确性。 Method: OLaPh框架融合了大型词典、多种NLP技术与复合词解析,并采用概率评分函数;此外,使用该框架生成的数据训练大语言模型以增强泛化能力。 Result: 在德语和英语上的评估显示,OLaPh优于先前方法,尤其是在挑战性数据集上;结合大语言模型后性能进一步提升。 Conclusion: OLaPh框架及其与大语言模型的结合显著提高了音素转换的一致性和准确性,为未来研究提供了开源资源。 Abstract: Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.

[65] Causal Understanding by LLMs: The Role of Uncertainty

Oscar Lithgow-Serrano,Vani Kanjirangat,Alessandro Antonucci

Main category: cs.CL

TL;DR: 大型语言模型在因果关系分类任务上表现接近随机猜测,研究表明其失败源于缺乏结构化的因果表征,而非预训练数据暴露不足。

Details Motivation: 探究大模型在因果关系分类中表现差的原因是由于预训练时接触因果例子不足,还是存在更深层的表征缺陷。 Method: 基于不确定性评估,在超过18,000条PubMed句子上测试七种模型,分析其在因果分类和原文记忆探测任务中的表现,比较模型对见过与未见过句子的处理能力。 Result: 模型在已见和未见句子上的准确率无显著差异(p > 0.05),几乎无记忆偏好(24.8%选择原文),输出分布平坦且熵值接近最大,表明为随机猜测;指令微调模型存在严重校准错误(如Qwen置信度>95%,准确率仅32.8%);条件因果关系引发最高熵值。 Conclusion: 大模型在因果理解上的失败主要源于缺乏结构化因果表征,而非预训练中因果实例暴露不足。 Abstract: Recent papers show LLMs achieve near-random accuracy in causal relation classification, raising questions about whether such failures arise from limited pretraining exposure or deeper representational gaps. We investigate this under uncertainty-based evaluation, testing whether pretraining exposure to causal examples improves causal understanding >18K PubMed sentences -- half from The Pile corpus, half post-2024 -- across seven models (Pythia-1.4B/7B/12B, GPT-J-6B, Dolly-7B/12B, Qwen-7B). We analyze model behavior through: (i) causal classification, where the model identifies causal relationships in text, and (ii) verbatim memorization probing, where we assess whether the model prefers previously seen causal statements over their paraphrases. Models perform four-way classification (direct/conditional/correlational/no-relationship) and select between originals and their generated paraphrases. Results show almost identical accuracy on seen/unseen sentences (p > 0.05), no memorization bias (24.8% original selection), and output distribution over the possible options is almost flat, with entropic values near the maximum (1.35/1.39), confirming random guessing. Instruction-tuned models show severe miscalibration (Qwen: > 95% confidence, 32.8% accuracy, ECE=0.49). Conditional relations induce highest entropy (+11% vs. direct). These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient exposure to causal examples during pretraining.

[66] Integrated Framework for LLM Evaluation with Answer Generation

Sujeong Lee,Hayoung Lee,Seongsoo Heo,Wonik Choi

Main category: cs.CL

TL;DR: 提出一种名为SPEED的集成评估框架,利用专用功能专家对大语言模型输出进行多维度、描述性的综合分析,相较于传统方法更公平、可解释且资源高效。

Details Motivation: 传统基于固定参考答案的基准评估方法难以捕捉生成回答的重要定性特征,限制了大语言模型在实际场景中的可靠评估。 Method: 设计并实现了一个名为SPEED的自优化描述性评估框架,结合多个功能专家从幻觉检测、毒性评估和词汇-上下文适当性等多个维度进行动态反馈与诊断。 Result: 实验表明,SPEED在不同领域和数据集上均表现出稳定一致的评估性能,并且使用较小的专家模型实现了更高的资源效率。 Conclusion: SPEED显著提升了大语言模型评估的公平性和可解释性,是一种优于现有方法的有前景的评估替代方案。 Abstract: Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.

[67] Less is More: The Effectiveness of Compact Typological Language Representations

York Hay Ng,Phuong Hanh Hoang,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 提出了一种优化URIEL+类型学特征空间的管道,通过结合特征选择和填补方法,生成紧凑且可解释的类型学表示,能够在语言距离对齐和多语言NLP任务中提升性能。

Details Motivation: 现有的语言特征数据集(如URIEL+)由于高维性和稀疏性,特别是在低资源语言上,限制了距离度量的有效性。 Method: 结合特征选择和缺失值填补方法,优化URIEL+的类型学特征空间,生成更紧凑的表示。 Result: 在语言距离对齐和下游任务中评估显示,缩减后的特征表示能产生更有效的距离度量,并提升多语言NLP应用的性能。 Conclusion: 减少语言类型学特征的维度不仅保持可解释性,还能提高跨语言任务中的表现。 Abstract: Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.

[68] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Chaojun Nie,Jun Zhou,Guanxiang Wang,Shisong Wud,Zichen Wang

Main category: cs.CL

TL;DR: 提出了一种基于增强生成的强化学习方法(RLAG),用于提升大语言模型在专业领域的表现,通过迭代采样和奖励优化机制有效嵌入关键且上下文连贯的领域知识。

Details Motivation: 大语言模型在特定领域任务上表现有限,主要由于训练数据中专业信息分布不均、知识静态导致的知识缺失和时间滞后问题。现有方法如持续预训练和监督微调存在未能突出关键知识点或难以构建连贯知识结构的局限。 Method: 提出Reinforcement Learning from Augmented Generation(RLAG),通过迭代生成采样(选择高对数概率输出)并计算三种定制化奖励指标来优化模型,从而增强关键且上下文一致的领域知识嵌入。 Result: 在医学、法律、天文学和时事等多个领域数据集上的实验表明,该方法在答案准确性和正确回答问题的解释合理性方面显著优于基线方法。 Conclusion: RLAG能有效弥补大语言模型在领域应用中的知识缺口,显著提升其在多样化专业任务上的性能,具备良好的扩展性和应用前景。 Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.

[69] Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian

Ghazal Kalhor,Behnam Bahrak

Main category: cs.CL

TL;DR: 本文提出了一种基于模板的探测方法和领域特定性别偏斜指数(DS-GSI),用于评估多语言大模型在低资源语言(如波斯语)中的性别偏见,发现现有模型在波斯语中存在显著性别刻板印象,尤其在体育领域最为严重。

Details Motivation: 为了确保全球广泛使用的多语言大语言模型不会对低资源语言使用者造成表征性伤害,亟需研究其在这些语言中的性别偏见问题。 Method: 提出一种经过真实数据验证的基于模板的探测方法,并引入领域特定性别偏斜指数(DS-GSI)来量化性别差异,评估四个主流模型在四种语义领域中的性别偏见。 Result: 在波斯语中所有模型均表现出明显的性别刻板印象,且偏见程度高于英语;体育领域展现出最严重的性别偏见。 Conclusion: 研究强调了在低资源语言中进行偏见评估的重要性,提出了可推广的偏见检测框架,呼吁更包容的NLP实践。 Abstract: Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.

[70] Thinking Augmented Pre-training

Liang Wang,Nan Yang,Shaohan Huang,Li Dong,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种名为Thinking augmented Pre-Training (TPT) 的方法,通过在预训练中引入自动生成的思维轨迹来提升大语言模型的数据效率。该方法显著提升了模型在多种配置下的性能,数据效率提高三倍,在3B参数模型上推理任务性能提升超过10%。

Details Motivation: 大语言模型预训练所需算力快速增长,但高质量数据有限,如何最大化利用现有数据成为关键挑战。部分高质量token因背后推理复杂而难以学习。 Method: 提出TPT方法,通过自动生 成思维轨迹增强文本数据,将复杂推理分解为逐步过程,提升高质 量token的可学习性,并适用于不同规模和训练阶段的模型。 Result: 在最多100B token的多种训练配置下验证了TPT的有效性,数据效率提升3倍;3B参数模型在多个推理基准上性能提升超10%。 Conclusion: TPT是一种通用且可扩展的方法,能显著提升大语言模型预训练的数据效率和推理能力,具有广泛适用性。 Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.

[71] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs

Parker Glenn,Alfy Samuel,Daben Liu

Main category: cs.CL

TL;DR: 本文研究了在声明式查询语言中集成由大语言模型(LLM)驱动的操作符,提出了一种高效确保LLM函数类型正确性的方法,在多跳问答数据集上实现了7%的准确率提升和53%的延迟降低。

Details Motivation: 将LLM与SQL等查询语言结合可发挥其强大推理能力,但生成结果需符合类型系统和数据库内容约束,现有方法因多次调用LLM进行后处理而导致性能瓶颈。 Method: 通过实验评估多种规模的开源语言模型在基于SQL的查询语言中解析和执行函数的能力,并提出一种高效的类型一致性保障机制,利用小型语言模型作为函数执行器处理混合数据源。 Result: 小型语言模型在函数执行任务中表现优异;所提方法在多跳问答任务上相比现有方案提升了7%的准确率,同时降低了53%的延迟。 Conclusion: 小型语言模型可有效充当LLM增强查询语言中的函数执行器,所提出的类型一致性机制显著提升了系统效率与准确性,为LLM与数据库系统的融合提供了可行路径。 Abstract: Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at https://github.com/parkervg/blendsql

[72] Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks

Hailay Kidu Teklehaymanot,Gebrearegawi Gidey,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文研究了利用多语言预训练模型的迁移学习技术,以提升形态丰富的低资源语言(如提格里尼亚语)的翻译质量。作者提出了一种结合语言特定分词、嵌入初始化和领域自适应微调的新方法,并构建了一个高质量的英-提格里尼亚语人工对齐评测数据集。实验表明,使用定制分词器的迁移学习显著优于零-shot 基线,结果通过自动指标和人工评估验证。研究强调了语言感知建模和可复现基准的重要性。

Details Motivation: 低资源语言(如提格里尼亚语)由于语料有限、分词策略不足以及缺乏标准化评测基准,在神经机器翻译中长期处于落后状态。 Method: 采用多语言预训练模型进行迁移学习,结合语言特定的分词方法、基于语言学信息的嵌入初始化和领域自适应微调,并构建高质量的人工对齐英-提格里尼亚语评测集。 Result: 所提方法在BLEU、chrF和人工评估上均显著优于零-shot 基线,经Bonferroni校正确认统计显著性;错误分析揭示了当前模型的主要问题。 Conclusion: 语言感知的建模策略和高质量、可复现的评测基准对于缩小低资源语言的性能差距至关重要。 Abstract: Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng

[73] Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models

Yu Wang,Leyi Lao,Langchu Huang,Gabriel Skantze,Yang Xu,Hendrik Buschmeier

Main category: cs.CL

TL;DR: 研究探讨了通过三种微调策略在英语和日语对话语料库上改进语言模型对后通道词和填充词的表示能力,发现微调能提升模型对其语义差异的区分度,并生成更接近人类话语的输出。

Details Motivation: 后通道词和填充词在对话中具有重要作用,但在现代基于Transformer的语言模型中表示不足,因此需要研究如何通过微调提升其表示能力。 Method: 采用三种微调策略,在保留并标注后通道词和填充词的英日双语对话语料库上训练模型,并通过聚类分析和NLG指标评估表示效果。 Result: 微调后的模型在聚类轮廓得分上有所提高,表明其能更好地区分后通道词和填充词的语义差异;NLG指标也显示生成的话语更接近人类表达。 Conclusion: 微调有助于将通用语言模型转化为更具对话能力、能生成更自然人类语言的会话型语言模型。 Abstract: Backchannels and fillers are important linguistic expressions in dialogue, but are under-represented in modern transformer-based language models (LMs). Our work studies the representation of them in language models using three fine-tuning strategies. The models are trained on three dialogue corpora in English and Japanese, where backchannels and fillers are preserved and annotated, to investigate how fine-tuning can help LMs learn their representations. We first apply clustering analysis to the learnt representation of backchannels and fillers, and have found increased silhouette scores in representations from fine-tuned models, which suggests that fine-tuning enables LMs to distinguish the nuanced semantic variation in different backchannel and filler use. We also use natural language generation (NLG) metrics to confirm that the utterances generated by fine-tuned language models resemble human-produced utterances more closely. Our findings suggest the potentials of transforming general LMs into conversational LMs that are more capable of producing human-like languages adequately.

[74] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage

Zipeng Ling,Yuehao Tang,Chen Huang,Shuliang Liu,Gaoyang Jiang,Shenghong Fu,Junqi Yang,Yao Wan,Jiawan Zhang,Kejia Huang,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出了大语言模型(LLM)推理中因提示设计不当导致的“指令边界”问题,引入BiasDetector框架来检测三种提示类型(完整、冗余、不足)引发的偏差,并通过实验表明尽管LLM整体准确率高,但提示覆盖不全仍会导致显著偏差,强调开发者需应对偏差,用户应谨慎设计提示。

Details Motivation: 由于用户可能无意中提供有偏或不完整的提示,导致LLM被误导,影响其可靠性和安全性,因此需要系统研究提示设计带来的偏差问题。 Method: 将指令边界问题归纳为八个具体方面,提出BiasDetector框架,用于量化由完整、冗余和不足三类指令引起的偏差,并在多个主流LLM上进行评估。 Result: 实验证明,尽管LLM在总体任务上表现良好,但在多种下游任务中仍存在显著偏差,这些偏差与提示的覆盖程度直接相关。 Conclusion: LLM推理的可靠性仍有较大提升空间,开发者应重视并解决提示偏差问题,同时用户需更谨慎地设计提示以提高模型输出的准确性与安全性。 Abstract: Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.

[75] Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation

Behzad Shayegh,Jan-Thorsten Peter,David Vilar,Tobias Domhan,Juraj Juraska,Markus Freitag,Lili Mou

Main category: cs.CL

TL;DR: 本文研究了机器翻译中充分性与流畅性之间的权衡,发现当前评估指标和元评估更偏向充分性,并提出通过合成翻译系统来控制这种偏见的方法。

Details Motivation: 理解机器翻译中充分性与流畅性的权衡对评估和元评估的公平性至关重要,但现有指标和元评估可能存在偏向性。 Method: 分析主流评估指标在充分性和流畅性上的表现,并研究WMT元评估中的系统组成对指标排名的影响,提出通过合成翻译系统来控制偏差的方法。 Result: 发现当前评估指标和元评估普遍偏向充分性,且该偏向部分源于参与元评估的翻译系统的构成。 Conclusion: 充分性与流畅性的权衡在评估和元评估中均存在,需通过更均衡的元评估设计来确保评估指标的公正性。 Abstract: We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.

[76] Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning

T. O. Abiola,K. D. Abiodun,O. E. Olumide,O. O. Adebanji,O. Hiram Calvo,Grigori Sidorov

Main category: cs.CL

TL;DR: 提出了一种基于主动学习和Transformer模型的多语言希望言语检测框架,在英语、西班牙语、德语和乌尔都语上表现优异,尤其在标注数据较少时仍保持高性能。

Details Motivation: 希望言语有助于促进积极的网络言论,但在多语言尤其是低资源环境下其检测仍具挑战性。 Method: 采用主动学习方法结合mBERT和XLM-RoBERTa等Transformer模型,构建多语言希望言语检测框架,并在多种语言数据集上进行实验。 Result: Transformer模型显著优于传统基线模型,其中XLM-RoBERTa整体准确率最高;主动学习策略在少量标注数据下仍保持良好性能。 Conclusion: 结合多语言Transformer模型与数据高效训练策略(如主动学习)能有效提升希望言语检测效果,尤其适用于低资源场景。 Abstract: Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.

[77] SIM-CoT: Supervised Implicit Chain-of-Thought

Xilin Wei,Xiaoran Liu,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Jiaqi Wang,Xipeng Qiu,Dahua Lin

Main category: cs.CL

TL;DR: 本文提出SIM-CoT,一种通过引入步骤级监督来稳定和增强隐式思维链(implicit CoT)的训练模块,解决了隐式CoT在扩展推理token时出现的表示同质化与训练不稳定问题,在多个模型上显著提升性能与稳定性。

Details Motivation: 隐式思维链方法虽更节省token,但在增加推理token以提升性能时易导致训练不稳定、语义多样性丧失,现有方法缺乏足够的步骤级监督,限制了其应用。 Method: 提出SIM-CoT,训练时引入一个辅助解码器,将每个隐式token对齐到对应的显式推理步骤,提供步骤级监督;推理时移除辅助解码器,保持高效性,并可通过投影实现隐式推理过程的可解释性。 Result: SIM-CoT在GPT-2上使Coconut提升+8.2%,在LLaMA-3.1 8B上使CODI提升+3.0%;在GPT-2上以2.3倍的token效率超过显式CoT基线,并在大模型上显著缩小性能差距。 Conclusion: SIM-CoT有效稳定了隐式CoT的训练过程,增强了语义多样性与泛化能力,兼具高效率、可解释性与强扩展性,推动了隐式推理方法的发展。 Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

[78] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee

Main category: cs.CL

TL;DR: 本文提出了一种新的基于语义片段的评估指标Z-Scores,用于更细致地分析口语中不流利现象去除的效果,相比传统的词级指标(如F1),Z-Scores能揭示模型在不同类型不流利现象(如EDITED、INTJ、PRN)上的具体表现和失败模式。

Details Motivation: 传统基于词的评估指标(如精确率、召回率和F1分数)无法深入解释模型在去除语音不流利现象时为何成功或失败,因此需要一种更具语言学依据的细粒度评估方法。 Method: 提出Z-Scores,一种基于语义片段的评估指标,结合确定性对齐模块,将生成文本与不流利转录进行鲁棒对齐,并按不流利类型(EDITED, INTJ, PRN)分类评估系统行为。 Result: Z-Scores能够揭示传统词级指标所掩盖的系统性弱点,特别是在处理INTJ和PRN类不流利现象时,案例研究表明其可有效识别大模型的失败模式。 Conclusion: Z-Scores提供类别级别的诊断能力,帮助研究者识别模型缺陷并设计针对性改进策略(如定制提示或数据增强),从而实现可衡量的性能提升。 Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

[79] DRES: Benchmarking LLMs for Disfluency Removal

Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee

Main category: cs.CL

TL;DR: 本文提出了DRES,一个用于评估言语不流利处理的可控文本级基准,基于人工标注的Switchboard转录数据,分离了语音识别错误和声学变异性的影响。研究系统地评估了多种大语言模型在不同规模、提示策略和架构下的表现,发现简单分段能持续提升性能,推理导向模型容易过度删除流畅词,微调虽能达到接近最先进的精度和召回率,但损害了泛化能力。文章还总结了LLM特有的错误模式,并提出了九条实用建议(R1-R9),为构建鲁棒的口语系统提供了可复现、模型无关的基础。

Details Motivation: 言语中的不流利现象(如“呃”、“啊”、插入语、括号内容和修正语句)对语音驱动系统(如指令理解、摘要生成和对话代理)的准确性构成持续挑战。现有方法常受ASR错误和声学变异干扰,难以准确评估纯文本层面的不流利处理能力,因此需要一个可控、可复现的基准来专门评估该任务。 Method: 构建DRES基准,基于人工标注的Switchboard会话语料,仅关注文本层面的不流利去除,排除ASR和声学因素。在该基准上系统评估多种开源与闭源大语言模型,考察不同模型规模、架构(如标准与推理优化模型)及提示策略(如零样本、少样本、分段提示)下的表现,并分析其错误模式。 Result: 实验结果表明:(i) 简单的分段提示能持续提升模型性能,即使对长上下文模型也有效;(ii) 推理导向的模型倾向于过度删除本应保留的流畅词;(iii) 微调模型在特定任务上接近SOTA的精度和召回率,但泛化能力下降。研究还识别出一系列LLM特有的错误类型。 Conclusion: DRES为言语不流利去除任务提供了一个可复现、模型无关的评估基础,有助于推动鲁棒口语系统的发展。研究表明,简单的提示工程优于微调,且当前模型存在明显的删除偏差。提出的九条实践建议(R1-R9)可指导该技术在语音驱动管道中的实际部署。 Abstract: Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

[80] Morphological Synthesizer for Ge'ez Language: Addressing Morphological Complexity and Resource Limitations

Gebrearegawi Gebremariam,Hailay Teklehaymanot,Gebregewergs Mezgebe

Main category: cs.CL

TL;DR: 本文提出了一种基于规则的吉兹语形态合成器,用于根据该语言的形态结构从词根生成表层词汇,实验结果显示系统性能达到97.4%。

Details Motivation: 由于缺乏标注语言数据、语料库、标签数据集和词典,至今尚未开发出可用的吉兹语自然语言处理工具。为了推动对埃塞俄比亚和厄立特里亚哲学、文化与文明的研究,需要构建支持吉兹语的语言技术。 Method: 采用基于规则的方法设计并实现吉兹语形态合成器,并使用1,102个代表性动词样本测试系统,覆盖所有动词形态结构。 Result: 系统在测试中达到了97.4%的准确率,优于基线模型,验证了规则方法在吉兹语形态生成中的有效性。 Conclusion: 本研究证明了规则-based方法在资源稀缺的吉兹语中的可行性,建议未来工作应构建更全面的系统以涵盖更多形态变化。 Abstract: Ge'ez is an ancient Semitic language renowned for its unique alphabet. It serves as the script for numerous languages, including Tigrinya and Amharic, and played a pivotal role in Ethiopia's cultural and religious development during the Aksumite kingdom era. Ge'ez remains significant as a liturgical language in Ethiopia and Eritrea, with much of the national identity documentation recorded in Ge'ez. These written materials are invaluable primary sources for studying Ethiopian and Eritrean philosophy, creativity, knowledge, and civilization. Ge'ez has a complex morphological structure with rich inflectional and derivational morphology, and no usable NLP has been developed and published until now due to the scarcity of annotated linguistic data, corpora, labeled datasets, and lexicons. Therefore, we propose a rule-based Ge'ez morphological synthesizer to generate surface words from root words according to the morphological structures of the language. We used 1,102 sample verbs, representing all verb morphological structures, to test and evaluate the system. The system achieves a performance of 97.4%, outperforming the baseline model and suggesting that future work should build a comprehensive system considering morphological variations of the language. Keywords: Ge'ez, NLP, morphology, morphological synthesizer, rule-based

[81] EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera,Sahil Dua,Biao Zhang,Daniel Salz,Ryan Mullins,Sindhu Raghuram Panyam,Sara Smoot,Iftekhar Naim,Joe Zou,Feiyang Chen,Daniel Cer,Alice Lisak,Min Choi,Lucas Gonzalez,Omar Sanseviero,Glenn Cameron,Ian Ballantyne,Kat Black,Kaifeng Chen,Weiyi Wang,Zhe Li,Gus Martins,Jinhyuk Lee,Mark Sherwood,Juyeong Ji,Renjie Wu,Jingxiao Zheng,Jyotinder Singh,Abheesht Sharma,Divya Sreepat,Aashi Jain,Adham Elarabawy,AJ Co,Andreas Doumanoglou,Babak Samari,Ben Hora,Brian Potetz,Dahun Kim,Enrique Alfonseca,Fedor Moiseev,Feng Han,Frank Palma Gomez,Gustavo Hernández Ábrego,Hesen Zhang,Hui Hui,Jay Han,Karan Gill,Ke Chen,Koert Chen,Madhuri Shanbhogue,Michael Boratko,Paul Suganthan,Sai Meher Karthik Duddu,Sandeep Mariserla,Setareh Ariafar,Shanfeng Zhang,Shijie Zhang,Simon Baumgartner,Sonam Goenka,Steve Qiu,Tanmaya Dabral,Trevor Walker,Vikram Rao,Waleed Khawaja,Wenlei Zhou,Xiaoqi Ren,Ye Xia,Yichang Chen,Yi-Ting Chen,Zhe Dong,Zhongli Ding,Francesco Visin,Gaël Liu,Jiageng Zhang,Kathleen Kenealy,Michelle Casbon,Ravin Kumar,Thomas Mesnard,Zach Gleicher,Cormac Brick,Olivier Lacombe,Adam Roberts,Yunhsuan Sung,Raphael Hoffmann,Tris Warkentin,Armand Joulin,Tom Duerig,Mojtaba Seyedhosseini

Main category: cs.CL

TL;DR: 本文介绍了EmbeddingGemma,一种基于Gemma 3语言模型家族的轻量级开源文本嵌入模型,通过创新的训练方法在多语言、英文和代码领域实现了最先进的性能。

Details Motivation: 开发一种高效、低成本且高性能的开源文本嵌入模型,适用于低延迟和高吞吐量的应用场景。 Method: 采用编码器-解码器初始化和几何嵌入蒸馏策略,结合分散正则化和多优化混合检查点合并来提升模型鲁棒性、表达能力和泛化能力。 Result: EmbeddingGemma(300M)在MTEB基准测试中表现优异,参数少于5亿的情况下超越了先前的最佳模型,并展现出与两倍规模模型相当的性能。 Conclusion: EmbeddingGemma因其卓越的性能成本比,特别适合设备端等对延迟敏感的应用,同时作者公开发布了该模型以促进进一步研究。 Abstract: We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.

[82] Language Models that Think, Chat Better

Adithya Bhaskar,Xi Ye,Danqi Chen

Main category: cs.CL

TL;DR: 本文提出了RLMT(基于模型奖励的强化学习)方法,将强化学习与可验证奖励扩展到开放性任务中,通过在响应前生成长链思维(CoT)并使用偏好模型进行优化,在多种模型和算法上 consistently 超越标准RLHF,显著提升对话、创意写作等能力,甚至使小规模基础模型超越大规模指令微调模型。

Details Motivation: 现有的强化学习与可验证奖励(RLVR)方法局限于数学和代码等可验证领域,难以泛化到开放性任务(如写文章或制定餐谱)。本文旨在突破这一限制,探索适用于通用对话能力的强化学习范式。 Method: 提出RLMT(Reinforcement Learning with Model-rewarded Thinking),要求语言模型在响应前生成详细的思维链(CoT),并通过在线强化学习(如PPO、DPO、GRPO)结合RLHF中使用的偏好模型进行优化。该方法可直接应用于基础模型,无需监督微调(SFT)阶段。 Result: 在Llama-3.1-8B和Qwen-2.5-7B上的40次训练中,RLMT在AlpacaEval2、WildBench、ArenaHardV2等对话基准上比传统RLHF提升3-7分,在创意写作和常识任务上提升1-3分。仅用7K提示训练的Llama-3.1-8B基础模型优于使用2500万样本多阶段训练的Llama-3.1-8B-Instruct模型,并在聊天和创意写作上超越GPT-4o,媲美Claude-3.7-Sonnet(Thinking)。 Conclusion: RLMT证明了强化学习在开放性任务中的有效性,重新定义了语言模型后训练流程,表明通过引入长链思维和模型驱动奖励,基础模型可直接高效优化,无需复杂多阶段训练,未来应更广泛地理解和利用‘思考’机制。 Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.

cs.CV [Back]

[83] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning

Nelson Alves Ferreira Neto

Main category: cs.CV

TL;DR: 提出了一种用于非铺装路面和越野环境下的自动驾驶感知系统,基于可配置的模块化分割网络(CMSNet),实现实时语义分割,并发布了包含近12,000张图像的Kamino数据集。

Details Motivation: 在露天矿和欠发达国家等非结构化地形中,需要低延迟的智能系统来实现自动驾驶,现有方法难以应对无明确路径和恶劣视觉条件的挑战。 Method: 提出Configurable Modular Segmentation Network (CMSNet),支持多种架构配置,结合TensorRT、C++和CUDA进行模型压缩与融合以实现实时推理,并使用多摄像头采集的真实场景数据进行训练与验证。 Result: CMSNet在两个数据集上验证有效,能够在夜间、雨天、灰尘等恶劣条件下准确分割可行驶区域和障碍物;发布的Kamino数据集包含近12,000张标注图像,具有高像素标注密度和复杂场景覆盖。 Conclusion: 该方法能够有效支持非铺装道路和越野环境下的自主导航,具备强鲁棒性和实时性,适用于实际工业应用场景。 Abstract: Low-latency intelligent systems are required for autonomous driving on non-uniform terrain in open-pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off-road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off-road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real-time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off-road proving ground emulating a mine under adverse visibility. To achieve real-time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system's effectiveness.

[84] Overview of LifeCLEF Plant Identification task 2020

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: PlantCLEF 2020挑战旨在利用植物标本馆数据提升生物多样性丰富但数据稀缺地区(如南美洲圭亚那地盾)的植物自动识别能力,通过跨域分类任务评估基于标本和野外照片的识别性能。

Details Motivation: 解决当前深度学习在植物识别中依赖大量图像数据而忽视热带等高生物多样性地区的数据不足问题,探索利用历史积累的数字化标本数据提升识别效果的可能性。 Method: 构建包含约1000个物种、以圭亚那地盾为重点的数据集,训练集由数十万份标本图像和数千张野外照片组成,测试集全为野外照片,作为跨域分类任务进行评估。 Result: 成功组织了PlantCLEF 2020挑战赛,提供了相关资源与评估结果,多个研究团队提交了跨域学习系统,实验表明利用标本数据有助于提升对数据稀缺地区植物的识别能力。 Conclusion: 植物标本馆的数字化资源可有效支持自动化植物识别模型的训练,尤其在缺乏野外图像数据的高生物多样性地区,具有重要应用潜力。 Abstract: Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or "PlantCLEF 2020") was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America's Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[85] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

Manyi Yao,Bingbing Zhuang,Sparsh Garg,Amit Roy-Chowdhury,Christian Shelton,Manmohan Chandraker,Abhishek Aich

Main category: cs.CV

TL;DR: iFinder 是一种结构化的语义对齐框架,通过将行车记录仪视频转换为分层、可解释的数据结构,实现大语言模型在驾驶场景中的零样本、可解释和可靠推理。

Details Motivation: 现有的视觉-语言模型在缺乏结构化归纳偏置的情况下,难以在仅依赖视觉的行车记录仪视频分析中进行空间推理、因果推断和事件解释。 Method: iFinder 采用模块化、无需训练的流水线,利用预训练视觉模型提取物体姿态、车道位置和轨迹等关键线索,并将其组织为帧级和视频级的分层结构,结合三段式提示策略,实现逐步、有根据的推理。 Result: 在四个公开的行车记录仪视频基准上,iFinder 相比端到端的视觉-语言模型显著提升了性能,事故推理准确率最高提升达39%。 Conclusion: 通过引入驾驶领域特定的表示,iFinder 为事后驾驶视频理解提供了一种优于现有端到端模型的零样本、可解释且可靠的替代方案。 Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

[86] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems

Fnu Shivam,Nima Najafzadeh,Yenumula Reddy,Prashnna Gyawali

Main category: cs.CV

TL;DR: 本文提出了CURE,首个无需身份标签的无监督人脸识别遗忘框架,并提出新的评估指标UES,在保护隐私的同时保持模型性能。

Details Motivation: 现有机器遗忘技术依赖于需要身份标签的监督方法,但在隐私受限或大规模噪声数据集中这些标签往往不可用,因此需要一种无需标签的无监督遗忘方法。 Method: 提出CURE(基于质心引导的无监督表征擦除)框架,通过无监督方式移除目标样本的影响,并设计了质量感知遗忘实验;同时提出Unlearning Efficiency Score (UES) 指标来平衡遗忘效果与模型稳定性。 Result: CURE显著优于现有的无监督遗忘方法变体,在遗忘效率和模型保留性能之间取得更好平衡,并验证了图像质量在机器遗忘中的作用。 Conclusion: CURE为面部识别系统提供了一种有效的无监督机器遗忘解决方案,能够在不使用身份标签的情况下实现高效数据遗忘,同时保持整体模型性能。 Abstract: In the current digital era, facial recognition systems offer significant utility and have been widely integrated into modern technological infrastructures; however, their widespread use has also raised serious privacy concerns, prompting regulations that mandate data removal upon request. Machine unlearning has emerged as a powerful solution to address this issue by selectively removing the influence of specific user data from trained models while preserving overall model performance. However, existing machine unlearning techniques largely depend on supervised techniques requiring identity labels, which are often unavailable in privacy-constrained situations or in large-scale, noisy datasets. To address this critical gap, we introduce CURE (Centroid-guided Unsupervised Representation Erasure), the first unsupervised unlearning framework for facial recognition systems that operates without the use of identity labels, effectively removing targeted samples while preserving overall performance. We also propose a novel metric, the Unlearning Efficiency Score (UES), which balances forgetting and retention stability, addressing shortcomings in the current evaluation metrics. CURE significantly outperforms unsupervised variants of existing unlearning methods. Additionally, we conducted quality-aware unlearning by designating low-quality images as the forget set, demonstrating its usability and benefits, and highlighting the role of image quality in machine unlearning.

[87] Synthesizing Artifact Dataset for Pixel-level Detection

Dennis Menn,Feng Liang,Diana Marculescu

Main category: cs.CV

TL;DR: 提出了一种自动注入伪影的管道方法,用于生成带有像素级标注的合成图像,从而无需人工标注即可训练伪影检测器,显著提升了检测性能。

Details Motivation: 由于缺乏高质量的人工像素级标注数据,伪影检测器的训练受限;现有伪标签方法因噪声标签导致性能不佳。 Method: 设计了一种伪影污染管道,在预定义区域向高质量合成图像自动注入伪影,生成带像素级标注的数据,用于训练检测器。 Result: 在人类标注数据上验证,该方法相比基线模型使ConvNeXt提升13.2%,Swin-T提升3.7%。 Conclusion: 该方法为构建可扩展的像素级伪影标注数据集提供了可行路径,并有望结合世界知识提升伪影检测能力。 Abstract: Artifact detectors have been shown to enhance the performance of image-generative models by serving as reward models during fine-tuning. These detectors enable the generative model to improve overall output fidelity and aesthetics. However, training the artifact detector requires expensive pixel-level human annotations that specify the artifact regions. The lack of annotated data limits the performance of the artifact detector. A naive pseudo-labeling approach-training a weak detector and using it to annotate unlabeled images-suffers from noisy labels, resulting in poor performance. To address this, we propose an artifact corruption pipeline that automatically injects artifacts into clean, high-quality synthetic images on a predetermined region, thereby producing pixel-level annotations without manual labeling. The proposed method enables training of an artifact detector that achieves performance improvements of 13.2% for ConvNeXt and 3.7% for Swin-T, as verified on human-labeled data, compared to baseline approaches. This work represents an initial step toward scalable pixel-level artifact annotation datasets that integrate world knowledge into artifact detection.

[88] Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Neeraj Gangwar,Anshuka Rangi,Rishabh Deshmukh,Holakou Rahmanian,Yesh Dattatreya,Nickvash Kani

Main category: cs.CV

TL;DR: 提出了一种渐进式任务特定的多任务适配方法,通过在预训练模型中引入共享与任务特定结合的适配模块,有效缓解了多任务学习中的任务干扰和负迁移问题,在减少可训练参数的同时优于现有方法。

Details Motivation: 解决参数高效微调方法在多任务学习中因可训练参数有限而导致的任务干扰和负迁移问题。 Method: 在预训练模型中引入渐进式的适配模块,前层共享以促进知识迁移,后层任务特定以减少任务冲突,并基于梯度计算任务相似性,将相似任务分配到共享模块。 Result: 在PASCAL和NYUD-v2数据集上,使用Swin Transformer进行密集预测任务实验,仅用五分之一的可训练参数即超过全微调多任务模型,并优于当前最先进的参数高效多任务学习方法。 Conclusion: 所提方法在显著减少可训练参数的情况下,有效提升了多任务学习性能,实现了更好的迁移效果并缓解了任务冲突。 Abstract: Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common challenges, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these issues, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. This approach introduces adapter modules in a pre-trained model such that these modules are shared across all tasks in the initial layers and become progressively more task-specific in the later layers. The motivation is to reduce the conflicts among tasks by allowing transfer learning across all tasks in the initial layers and enabling task-specific learning toward the prediction heads. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. Our task similarity method introduces minimal overhead in the pipeline. We evaluate our approach by adapting the Swin Transformer for dense prediction tasks. Experiments on the PASCAL and NYUD-v2 datasets demonstrate that our approach outperforms a fully fine-tuned multi-task model while requiring only one-fifth of the trainable parameters. This approach achieves better relative improvement to single-task fine-tuning while reducing the number of trainable parameters and surpasses the current state-of-the-art methods for parameter-efficient multi-task learning.

[89] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG

Mahmoud Afifi,Ran Zhang,Michael S. Brown

Main category: cs.CV

TL;DR: 本文提出了一种名为RawJPEG Adapter的轻量级、可学习且可逆的预处理管道,用于将相机原始图像适配到标准JPEG压缩中,在保持高保真重建的同时实现高效压缩。

Details Motivation: 原始图像数据(如DNG格式)虽然保留了完整的传感器信息,但占用存储空间大;而JPEG虽压缩效率高但不适合存储原始数据。因此需要一种兼顾高效压缩、兼容性和重建精度的原始图像存储方法。 Method: 提出RawJPEG Adapter,通过空间域和可选的频域变换对原始图像进行预处理,并将紧凑的变换参数存储在JPEG注释字段中,使标准JPEG编码器可用于原始图像压缩,同时支持解码后的准确重建。 Result: 在多个数据集上的实验表明,该方法相比直接使用JPEG存储原始图像具有更高的重建保真度,兼容其他编码器,并在压缩比与重建精度之间实现了更优的权衡。 Conclusion: RawJPEG Adapter为在资源受限场景下高效存储原始图像提供了一种实用解决方案,兼具高压缩率、良好兼容性和高重建质量。 Abstract: Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information--valuable for editing and vision tasks--formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.

[90] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar

William L. Muckelroy III,Mohammed Alsakabi,John M. Dolan,Ozan K. Tonguz

Main category: cs.CV

TL;DR: 本文研究了使用高容量分割主干网络对4D雷达生成LiDAR-like点云质量的影响,发现适当选择的主干网络可比现有最先进方法提升23.7%的性能。

Details Motivation: 由于LiDAR成本高昂,限制了高级自动驾驶系统的普及,因此需要探索仅使用4D雷达生成高质量3D点云的方法,以替代昂贵的LiDAR。 Method: 采用基于4D雷达的神经网络方法,利用LiDAR点云作为真值标签,训练不同容量的分割主干网络,并在RaDelft数据集上评估其生成点云的质量。 Result: 实验表明,过高的模型容量可能损害性能,但选择合适的高容量分割主干网络可在现有最先进技术基础上提升23.7%的性能。 Conclusion: 合理设计的高容量分割主干网络能显著提升由4D雷达生成的LiDAR-like点云质量,为低成本自动驾驶感知系统提供了有效解决方案。 Abstract: LiDAR's dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR's high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).

[91] Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza

Main category: cs.CV

TL;DR: 提出一个包含1343个图像-问题对的新闻图像基准,用于评估大型视觉-语言模型中的社会偏见,发现视觉上下文会系统性影响模型输出,且偏见在不同属性和模型间差异显著,尤其在性别和职业方面。

Details Motivation: 大型视觉-语言模型容易吸收和再现与年龄、性别、种族、职业等视觉线索相关的社会偏见,需系统评估其公平性风险。 Method: 构建新闻图像基准,包含图像-问题对及其标注的 demographic 属性,使用前沿VLMs进行评估,并以大语言模型为裁判辅以人工验证。 Result: 发现视觉上下文显著影响模型输出;偏见程度因属性和模型而异,性别和职业偏见最严重;模型忠实度高并不意味着偏见更低。 Conclusion: 需警惕VLM在真实场景中的社会偏见,高忠实度不等于低偏见,应推动公平性感知的多模态评估方法。 Abstract: Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

[92] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning

Zeyu He,Shuai Huang,Yuwu Lu,Ming Zhao

Main category: cs.CV

TL;DR: 提出MoTiC框架,通过贝叶斯分析对齐新类先验与旧类统计特性,结合大规模对比学习和动量自监督,有效减少估计偏差并提升少样本类增量学习性能。

Details Motivation: 解决少样本类增量学习中新类原型因数据稀缺导致的估计偏差问题,同时缓解灾难性遗忘和过拟合。 Method: 基于贝叶斯分析对齐新类先验与旧类统计;引入大规模对比学习增强跨类别特征紧致性;在MoTiC框架中融合动量自监督和虚拟类别以丰富特征多样性。 Result: 在三个FSCIL基准上达到最先进性能,尤其在细粒度任务CUB-200上表现突出。 Conclusion: 所提方法能有效降低新类原型估计偏差,增强特征表示多样性和类间凝聚性,显著提升增量学习的鲁棒性。 Abstract: Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method's ability to reduce estimation bias and improve incremental learning robustness.

[93] Deep Learning for Clouds and Cloud Shadow Segmentation in Methane Satellite and Airborne Imaging Spectroscopy

Manuel Perez-Carrasco,Maya Nasr,Sebastien Roche,Chris Chan Miller,Zhan Zhang,Core Francisco Park,Eleanor Walker,Cecilia Garraffo,Douglas Finkbeiner,Ritesh Gautam,Steven Wofsy

Main category: cs.CV

TL;DR: 本研究比较了多种机器学习方法在MethaneSAT和MethaneAIR高分辨率传感器中进行云和云影检测的效果,发现深度学习模型(尤其是UNet和SCAN)显著优于传统方法,其中SCAN在卫星数据上表现最佳,突显了光谱注意力机制的优势。

Details Motivation: 云和云影会干扰高光谱遥感中的甲烷浓度反演,影响排放量量化,因此需要高效准确的检测方法。 Method: 采用机器学习方法,包括传统模型(ILR、MLP)和深度学习模型(UNet、SCAN),在MethaneSAT和MethaneAIR数据上进行云与云影检测,并评估其性能。 Result: 传统方法在空间一致性和边界定义上表现不佳;UNet在保持空间结构方面最优,SCAN在捕捉精细边界细节上更优,且在MethaneSAT数据上超越UNet。 Conclusion: 深度学习架构(特别是引入光谱注意力的SCAN)能为当前和下一代高光谱任务提供鲁棒、可扩展的云和云影筛查解决方案。 Abstract: Effective cloud and cloud shadow detection is a critical prerequisite for accurate retrieval of concentrations of atmospheric methane or other trace gases in hyperspectral remote sensing. This challenge is especially pertinent for MethaneSAT and for its airborne companion mission, MethaneAIR. In this study, we use machine learning methods to address the cloud and cloud shadow detection problem for sensors with these high spatial resolutions instruments. Cloud and cloud shadows in remote sensing data need to be effectively screened out as they bias methane retrievals in remote sensing imagery and impact the quantification of emissions. We deploy and evaluate conventional techniques including Iterative Logistic Regression (ILR) and Multilayer Perceptron (MLP), with advanced deep learning architectures, namely UNet and a Spectral Channel Attention Network (SCAN) method. Our results show that conventional methods struggle with spatial coherence and boundary definition, affecting the detection of clouds and cloud shadows. Deep learning models substantially improve detection quality: UNet performs best in preserving spatial structure, while SCAN excels at capturing fine boundary details. Notably, SCAN surpasses UNet on MethaneSAT data, underscoring the benefits of incorporating spectral attention for satellite specific features. This in depth assessment of various disparate machine learning techniques demonstrates the strengths and effectiveness of advanced deep learning architectures in providing robust, scalable solutions for clouds and cloud shadow screening towards enhancing methane emission quantification capacity of existing and next generation hyperspectral missions. Our data and code is publicly available at https://doi.org/10.7910/DVN/IKLZOJ

[94] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

Sumit Mamtani

Main category: cs.CV

TL;DR: 提出两种轻量级优化技术(STA和ANF)以减少Vision Transformers特征图中的结构噪声,提升可解释性和下游任务性能。

Details Motivation: Vision Transformers在多种视觉任务中表现优异,但其特征图中的结构噪声影响了分割和深度估计等下游应用。 Method: 通过空间扰动增强令牌多样性的结构化令牌增强(STA),以及在Transformer层间应用可学习的内联去噪自适应噪声过滤(ANF)。 Result: 在ImageNet、Ade20k和NYUv2等多个标准基准上验证了方法的有效性,显著提升了视觉质量和任务性能。 Conclusion: 所提出的STA和ANF方法具有架构无关性,能有效减轻ViT特征图中的结构噪声,提升模型可解释性和实际应用效果。 Abstract: Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.

[95] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

Ling Lo,Kelvin C. K. Chan,Wen-Huang Cheng,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 提出一种通过逐帧引导去噪过程实现平滑一致属性过渡的简单有效方法,并构建了包含属性和运动动态的CAT-Bench评测基准。

Details Motivation: 现有模型在处理视频生成中的渐进属性变化时存在困难,常见的提示插值方法难以应对渐变过程,导致不一致性问题突出。 Method: 在去噪过程中引入逐帧引导,为每个噪声隐变量构建数据特定的过渡方向,逐帧引导从初始到目标属性的渐变,同时保留视频的运动动态。 Result: 实验结果表明,该方法在视觉保真度、文本对齐性和属性过渡平滑性方面优于现有基线模型;提出了CAT-Bench评测基准及两个评估指标用于综合评价属性过渡效果。 Conclusion: 所提方法能有效提升视频生成中渐进属性过渡的质量,兼顾一致性与运动连贯性,推动了可控视频生成的发展。 Abstract: Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.

[96] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification

Alexander Thorley,Agis Chartsias,Jordan Strom,Roberto Lang,Jeremy Slivnick,Jamie O'Driscoll,Rajan Sharma,Dipak Kotecha,Jinming Duan,Alberto Gomez

Main category: cs.CV

TL;DR: 本文提出一种通过将Transformer模型限制在心肌区域来提高心脏淀粉样变性(CA)分类性能的方法,利用变形点和图像块作为输入,并结合自监督学习的掩码自编码器预训练,确保分类基于临床相关特征。

Details Motivation: 现有的神经网络方法在全视频上进行分类,无法保证分类依据是与CA相关的临床重要特征。因此,需要一种能够聚焦于心肌区域并利用已知CA异常位置的方法。 Method: 将心肌表示为一组变形点及对应的图像块,并将其嵌入到Transformer模型的输入令牌中;同时,在自监督学习的掩码自编码器预训练中仅掩码和重建这些解剖学相关的图像块。 Result: 通过将模型和预训练任务限制在心肌区域,相比全视频Transformer模型,在CA分类任务上实现了更高的性能。 Conclusion: 该方法不仅提高了CA分类的准确性,还确保了分类过程专注于超声心动图中的特定解剖区域,并可通过可视化注意力分数理解模型关注的心肌动态变化。 Abstract: Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur -- the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.

[97] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification

Woo-Jin Cho Kim,Jorge Oliveira,Arian Beqiri,Alex Thorley,Jordan Strom,Jamie O'Driscoll,Rajan Sharma,Jeremy Slivnick,Roberto Lang,Alberto Gomez,Agisilaos Chartsias

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的超声心动图视频片段选择方法,通过学习何时保留或停止处理视图特定片段来优化疾病分类性能,并结合注意力机制融合多片段信息,在仅使用30%片段的情况下显著提升了心脏淀粉样变性的检测性能。

Details Motivation: 传统方法在自动分析超声心动图时通常依赖单一视频片段或平均所有片段的预测结果,前者忽略了其他片段的互补信息,后者计算成本高且不利于临床应用。因此,需要一种能有效选择最优片段子集以提升分类性能并降低计算负担的方法。 Method: 采用强化学习框架,设计智能体根据当前分类不确定性决定是否继续处理新的视图片段或停止;同时引入可学习的基于注意力机制的多片段信息融合策略,灵活整合来自不同片段的信息。 Result: 在心脏淀粉样变性检测任务中,该方法仅使用30%的视频片段即达到0.91的AUC,优于使用全部片段及其他基准方法的表现。 Conclusion: 所提出的方法能够高效选择最具诊断价值的超声片段,在显著减少数据量的同时提升分类性能,具有良好的临床应用潜力。 Abstract: Guidelines for transthoracic echocardiographic examination recommend the acquisition of multiple video clips from different views of the heart, resulting in a large number of clips. Typically, automated methods, for instance disease classifiers, either use one clip or average predictions from all clips. Relying on one clip ignores complementary information available from other clips, while using all clips is computationally expensive and may be prohibitive for clinical adoption. To select the optimal subset of clips that maximize performance for a specific task (image-based disease classification), we propose a method optimized through reinforcement learning. In our method, an agent learns to either keep processing view-specific clips to reduce the disease classification uncertainty, or stop processing if the achieved classification confidence is sufficient. Furthermore, we propose a learnable attention-based aggregation method as a flexible way of fusing information from multiple clips. The proposed method obtains an AUC of 0.91 on the task of detecting cardiac amyloidosis using only 30% of all clips, exceeding the performance achieved from using all clips and from other benchmarks.

[98] Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis

Jiesi Hu,Yanwu Yang,Zhiyu Ye,Chenfei Ye,Hanyang Peng,Jianfeng Cao,Ting Ma

Main category: cs.CV

TL;DR: 提出了一种基于域随机化的新型数据合成框架SynthICL,用于解决通用医学图像分割中上下文学习对大规模多样化数据的需求问题。

Details Motivation: 上下文学习(ICL)在医学图像分割中的兴起导致对大规模、多样化训练数据的需求激增,加剧了长期存在的数据稀缺问题。现有数据合成方法难以同时实现高数据多样性和适合医学数据的领域分布。 Method: 构建了一个基于域随机化的数据合成框架SynthICL,利用真实世界数据集的解剖先验保证真实性,生成多样的解剖结构,并显式建模个体间差异以适应ICL需求。 Result: 在四个保留数据集上的实验表明,使用该框架生成的数据训练的模型平均Dice分数最高提升63%,并对未见过的解剖域展现出显著增强的泛化能力。 Conclusion: SynthICL有助于缓解基于ICL分割的数据瓶颈,推动更鲁棒模型的发展。 Abstract: The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbf{SynthICL}, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework's effectiveness, showing that models trained with our data achieve performance gains of up to 63\% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at https://github.com/jiesihu/Neuroverse3D.

[99] VIMD: Monocular Visual-Inertial Motion and Depth Estimation

Saimouli Katragadda,Guoquan Huang

Main category: cs.CV

TL;DR: 提出了一种基于单目视觉-惯性运动跟踪的密集度量深度估计框架VIMD,利用多视角信息迭代优化像素级尺度,具有高模块化、强鲁棒性和良好的零样本泛化能力。

Details Motivation: 为了在资源受限环境下实现准确且高效的密集度量深度估计,克服现有方法中全局仿射模型拟合的局限性。 Method: 结合MSCKF-based视觉-惯性运动跟踪,通过多视角信息迭代优化每个像素的尺度,框架模块化,可兼容多种深度估计骨干网络。 Result: 在TartanAir和VOID数据集上表现优异,在AR Table数据集上展示了出色的零样本泛化能力,即使每幅图像仅有10-20个稀疏度量深度点也能保持高精度。 Conclusion: VIMD是一种高效、准确且具备强鲁棒性和泛化能力的深度估计方案,适用于资源受限的实际部署场景。 Abstract: Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.

[100] Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

Bo Yu,Jianhua Yang,Zetao Du,Yan Huang,Chenglong Li,Liang Wang

Main category: cs.CV

TL;DR: 提出了一种基于频域多模态交互的医学图像分割模型FMISeg,通过结合临床文本报告和频域视觉特征,有效提升了肺部感染区域的分割性能。

Details Motivation: 现有方法在融合视觉与语言模态时存在语义鸿沟,且难以捕捉病变复杂的形态变化,导致分割效果不佳。 Method: 提出FMISeg模型,采用晚期融合策略,在解码器中引入频域特征双向交互模块(FFBI)和语言引导的频域特征交互模块(LFFI),增强视觉表示并抑制语义无关信息。 Result: 在QaTa-COV19和MosMedData+数据集上实验表明,该方法在定性和定量评估中均优于现有最先进方法。 Conclusion: FMISeg能有效利用临床文本指导医学图像分割,通过频域多模态交互提升感染区域分割精度,具有良好的应用潜力。 Abstract: Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

[101] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Yufei Han,Bowen Tie,Heng Guo,Youwei Lyu,Si Li,Boxin Shi,Yunpeng Jia,Zhanyu Ma

Main category: cs.CV

TL;DR: 提出了一种名为PolGS的偏振高斯点阵模型,用于快速反射表面重建,结合偏振约束提升复杂反射材质的重建质量。

Details Motivation: 现有3D高斯点阵方法在处理复杂反射表面时重建质量不足,难以满足实时虚拟现实对高效高质量形状重建的需求。 Method: 将偏振约束引入3D高斯点阵框架,有效分离镜面反射和漫反射成分,实现10分钟内快速反射表面重建。 Result: 在合成和真实数据集上的实验结果验证了该方法在提升反射表面重建质量方面的有效性。 Conclusion: PolGS显著提升了具有复杂反射特性的表面重建质量,同时保持了高效的渲染速度,适用于实时虚拟现实应用。 Abstract: Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.

[102] CAMILA: Context-Aware Masking for Image Editing with Language Alignment

Hyunseung Kim,Chiho Choi,Srikanth Malla,Sai Prahladh Padmanabhan,Saurabh Bagchi,Joon Hee Choi

Main category: cs.CV

TL;DR: 本文提出了一种名为CAMILA的上下文感知图像编辑方法,通过语言对齐和上下文一致性验证,有效处理不可行或矛盾的文本指令,提升图像编辑的语义对齐性和完整性。

Details Motivation: 现有图像编辑模型往往盲目执行所有用户指令,即使这些指令不可行或矛盾,导致输出不合理。因此需要一种能识别并忽略不可执行指令的方法。 Method: 提出CAMILA(Context-Aware Masking for Image Editing with Language Alignment),通过上下文感知掩码机制验证指令与图像的语义一致性,仅对可执行区域进行编辑,并构建包含不可行请求的单/多指令数据集进行评估。 Result: 在单指令和多指令编辑任务中,CAMILA优于现有最先进模型,表现出更高的语义对齐性和对复杂指令的处理能力,同时更好地保持图像完整性。 Conclusion: CAMILA能够有效识别并过滤不可执行的文本指令,在复杂场景下实现更合理、更高质量的图像编辑,提升了文本引导图像编辑的实用性与鲁棒性。 Abstract: Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

[103] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation

Hongtao Yang,Bineng Zhong,Qihua Liang,Zhiruo Zhu,Yaozong Zheng,Ning Li

Main category: cs.CV

TL;DR: 提出了一种基于视觉傅里叶提示的RGB-热成像跟踪方法VFPTrack,通过融合空间域和频率域信息提升多模态特征交互性能。

Details Motivation: 现有基于参数高效微调的RGB-T跟踪方法仅利用空间域信息,忽略了频域信息对提示学习的重要性,导致性能受限。 Method: 采用共享参数的对称特征提取编码器提取RGB和热红外特征,结合空间域视觉提示与FFT获得的频域提示,并设计模态融合提示生成器实现双向特征交互。 Result: 在三个主流RGB-T跟踪基准上的实验表明,该方法在性能上表现出色,优于现有PEFT方法。 Conclusion: VFPTrack通过引入频域提示和新型模态融合机制,有效提升了RGB-T跟踪中多模态特征的学习与交互能力。 Abstract: Recently, visual prompt tuning is introduced to RGB-Thermal (RGB-T) tracking as a parameter-efficient finetuning (PEFT) method. However, these PEFT-based RGB-T tracking methods typically rely solely on spatial domain information as prompts for feature extraction. As a result, they often fail to achieve optimal performance by overlooking the crucial role of frequency-domain information in prompt learning. To address this issue, we propose an efficient Visual Fourier Prompt Tracking (named VFPTrack) method to learn modality-related prompts via Fast Fourier Transform (FFT). Our method consists of symmetric feature extraction encoder with shared parameters, visual fourier prompts, and Modality Fusion Prompt Generator that generates bidirectional interaction prompts through multi-modal feature fusion. Specifically, we first use a frozen feature extraction encoder to extract RGB and thermal infrared (TIR) modality features. Then, we combine the visual prompts in the spatial domain with the frequency domain prompts obtained from the FFT, which allows for the full extraction and understanding of modality features from different domain information. Finally, unlike previous fusion methods, the modality fusion prompt generation module we use combines features from different modalities to generate a fused modality prompt. This modality prompt is interacted with each individual modality to fully enable feature interaction across different modalities. Extensive experiments conducted on three popular RGB-T tracking benchmarks show that our method demonstrates outstanding performance.

[104] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation

Xinhao Zhong,Shuoyang Sun,Xulin Gu,Chenyang Zhu,Bin Chen,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出了修正的解耦数据集蒸馏方法(RD$^3$),系统分析了不同后评估设置对测试准确率的影响,揭示了现有方法性能差异主要源于评估不一致而非合成数据质量差异,并建立了标准化基准以促进公平、可复现的比较。

Details Motivation: 现有解耦数据集蒸馏方法因后评估协议不一致导致性能比较不可靠,阻碍了该领域的发展,因此需要统一评估标准以准确衡量方法优劣。 Method: 提出RD$^3$框架,分离教师模型预训练与合成数据生成,引入标准化的后评估流程,并系统研究不同评估设置(如数据增强、软标签策略)对结果的影响。 Result: 发现性能差异主要来自评估协议的不一致而非方法本身;识别出提升蒸馏数据有效性的通用策略;在多个设置下实现了更稳定和可复现的结果。 Conclusion: 建立标准化评估协议对数据集蒸馏研究至关重要,RD$^3$为未来工作提供了可靠基准,有助于推动该领域的公平比较与实质性进展。 Abstract: Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.

[105] nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation

Yi Yang

Main category: cs.CV

TL;DR: 提出了一种名为nnFilterMatch的半监督医学图像分割框架,结合了自适应伪标签过滤和单次训练流程,在仅使用5%–20%标注数据的情况下达到或超过全监督模型的性能。

Details Motivation: 传统半监督与主动学习结合的方法依赖多次重训练循环,计算开销大、难以扩展,因此需要一种高效且可扩展的医学图像分割方法以减少标注负担。 Method: 将基于熵的伪标签过滤机制(FilterMatch)嵌入单次nnU-Net训练框架中,在训练过程中选择性排除高置信度伪标签,避免重训练循环,实现端到端的学习。 Result: 在多个临床分割基准上验证,仅用5%–20%标注数据即可达到甚至超越全监督模型的性能,显著降低计算开销。 Conclusion: nnFilterMatch提供了一种可扩展、高效的端到端医学图像分割方案,有效减少标注需求而不牺牲精度。 Abstract: Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5\%--20\% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: https://github.com/Ordi117/nnFilterMatch.git.

[106] Talking Head Generation via AU-Guided Landmark Prediction

Shao-Yu Chang,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 提出了一种两阶段框架,通过显式建模面部动作单元(AUs)到2D面部关键点的映射,实现音频驱动的说话头生成与细粒度表情控制。

Details Motivation: 现有方法依赖情感标签或隐式的AU条件,缺乏对表情的精确、物理可解释的控制,因此需要一种能够实现逐帧、细粒度且物理合理的表情调控方法。 Method: 第一阶段使用变分运动生成器根据音频和AU强度预测时间连贯的关键点序列;第二阶段采用基于扩散的合成器,结合这些关键点和参考图像生成逼真且唇形同步的视频。 Result: 在MEAD数据集上的实验表明,该方法在多个指标上优于现有最先进基线,具有更高的表情准确性、时间稳定性和视觉真实感。 Conclusion: 显式的AU到关键点建模有效提升了表达性说话头生成的质量,所提出的两阶段分离运动与外观的框架具有优越性能。 Abstract: We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.

[107] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition

Jinhui Zheng,Xueyuan Gong

Main category: cs.CV

TL;DR: 提出了一种新的指数角度边距损失(ExpFace),通过在角度空间中引入指数形式的边距,有效区分并处理干净样本与噪声样本,提升了开集人脸识别性能。

Details Motivation: 现有的边距Softmax损失方法忽视了噪声样本的影响,而观察发现噪声样本多分布在角度空间的边缘区域,因此需要一种能自适应调节惩罚力度的方法。 Method: 提出ExpFace,引入指数形式的角度边距,在角度空间中心区域施加更大惩罚以增强对干净样本的关注,边缘区域减小惩罚以抑制噪声样本,并从边距嵌入形式、相似性曲线和梯度曲线进行统一分析。 Result: ExpFace避免了SphereFace的训练不稳定性和ArcFace的非单调性问题,具有与角度空间决策边界一致的惩罚机制,在多个实验中达到最先进水平。 Conclusion: ExpFace通过自适应角度边距设计,有效提升了人脸识别中的类内紧凑性和类间可分性,尤其在存在噪声样本的情况下表现优越。 Abstract: Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.

[108] Logics-Parsing Technical Report

Xiangyang Chen,Shuzhao Li,Xiuwen Zhu,Yongfan Chen,Fan Yang,Cheng Fang,Lin Qu,Xiaoxiao Xu,Hu Wei,Minggang Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于大视觉语言模型(LVLM)并结合强化学习的端到端文档解析模型Logics-Parsing,通过设计奖励机制优化复杂版式分析与阅读顺序推断,并引入包含化学公式和手写汉字的多样化数据增强模型泛化能力,同时构建了大规模评测基准LogicsParsingBench,实验证明该方法在多种文档场景下达到SOTA性能。

Details Motivation: 现有LVLM在文档解析中缺乏对版面结构和阅读顺序的显式建模,难以处理多栏报纸、海报等复杂文档,限制了其在真实场景中的应用。 Method: 提出Logics-Parsing模型,结合强化学习设计奖励机制以优化布局分析与阅读顺序推理;在监督微调中引入化学公式、手写汉字等多样数据提升模型泛化能力;构建专用评测集LogicsParsingBench用于系统评估。 Result: 在自建的LogicsParsingBench数据集(1,078页PDF图像,涵盖9大类20余子类)上的实验表明,所提模型在复杂文档解析任务上显著优于现有方法,取得SOTA性能。 Conclusion: Logics-Parsing通过引入强化学习和多样化训练数据,有效提升了LVLM在复杂文档布局理解和阅读顺序推理方面的能力,推动了端到端文档解析技术的发展。 Abstract: Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

[109] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures

Hartmut Häntze,Myrthe Buser,Alessa Hering,Lisa C. Adams,Keno K. Bressem

Main category: cs.CV

TL;DR: 该研究发现Dice相似系数(DSC)在评估不同性别器官分割性能时存在固有偏差,由于女性器官体积通常较小,相同的分割误差会导致比男性更低的DSC值,从而影响公平性评价。

Details Motivation: 揭示DSC这一常用分割评价指标本身可能引入的性别偏差,提醒在医学图像分析中需更谨慎地解释跨性别性能差异。 Method: 在50名参与者的MRI手动标注上施加相同大小的合成误差,比较男女之间的DSC和归一化DSC差异,排除具体模型影响,在理想条件下量化DSC的性别偏差。 Result: 即使微小误差(如1mm边界偏移)也会导致系统性的DSC性别差异;小结构平均DSC差约0.03,中等结构约0.01,仅大结构(如肺、肝)几乎无差异。 Conclusion: DSC本身会因器官大小不同而引入性别相关偏差,因此在使用DSC进行公平性评估时不应期望男女得分相同,需考虑指标本身的偏倚。 Abstract: Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.

[110] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction

Yu-Shen Huang,Tzu-Han Chen,Cheng-Yen Hsiao,Shaou-Gang Miaou

Main category: cs.CV

TL;DR: 本文提出了一种轻量级视觉Transformer架构,用于在边缘设备上实现高质量、无重影的高动态范围(HDR)成像。

Details Motivation: 现有MEF方法存在计算成本高和重影伪影的问题,难以在资源受限的边缘设备上部署。 Method: 基于上下文感知视觉Transformer,将输入图像转换为YCbCr颜色空间,采用交集感知自适应融合(IAAF)模块抑制重影,并引入反残差嵌入(IRE)、动态Tanh(DyT)和增强型多尺度空洞卷积(E-MSDC)实现轻量化设计。 Result: 相比基线模型,主版本减少约67%的FLOPs,在CPU上推理速度提升五倍以上,在边缘设备上提升2.5倍,同时保持高质量成像。 Conclusion: 所提方法在性能、效率和视觉质量之间取得了良好平衡,为边缘设备提供了高效、无重影的HDR成像解决方案。 Abstract: Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous driving. Multi-Exposure Fusion (MEF) is a mainstream technique to achieve this goal; however, existing methods generally face the dual bottlenecks of high computational costs and ghosting artifacts, hindering their widespread deployment. To this end, this study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction to overcome these limitations. This study is based on the Context-Aware Vision Transformer and begins by converting input images to the YCbCr color space to separate luminance and chrominance information. It then employs an Intersection-Aware Adaptive Fusion (IAAF) module to suppress ghosting effectively. To further achieve a light-weight design, we introduce Inverted Residual Embedding (IRE), Dynamic Tanh (DyT), and propose Enhanced Multi-Scale Dilated Convolution (E-MSDC) to reduce computational complexity at multiple levels. Our study ultimately contributes two model versions: a main version for high visual quality and a light-weight version with advantages in computational efficiency, both of which achieve an excellent balance between performance and image quality. Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67% and increases inference speed by more than fivefold on CPU and 2.5 times on an edge device. These results confirm that our method provides an efficient and ghost-free HDR imaging solution for edge devices, demonstrating versatility and practicality across various dynamic scenarios.

[111] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting

Yixun Zhang,Feng Zhou,Jianqin Yin

Main category: cs.CV

TL;DR: 本文提出了BiTAA,一种基于3D高斯溅射的双任务对抗攻击方法,可同时破坏目标检测和单目深度估计,揭示了自动驾驶中多任务视觉感知的脆弱性。

Details Motivation: 现有对抗攻击多局限于单一任务,缺乏对检测与深度估计之间交互影响的研究,且无法可控地引入深度偏差,也缺少量化跨任务迁移的标准协议。 Method: 提出BiTAA框架,采用双模型攻击结构,结合全图与贴片设置,利用3D高斯溅射生成统一扰动;设计复合损失函数,联合优化检测抑制与有符号、可控幅度的对数深度偏差,并支持EOT以增强物理可实现性。 Result: 实验表明BiTAA能有效实现跨任务性能下降,提出统一评估协议验证了跨任务迁移的一致性退化现象,并发现从检测到深度的迁移比反向更强,存在明显不对称性。 Conclusion: 研究揭示了相机-only多任务感知的实际风险,强调需发展跨任务感知的防御机制以提升自动驾驶安全性。 Abstract: Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.

[112] StrCGAN: A Generative Framework for Stellar Image Restoration

Shantanusinh Parmar

Main category: cs.CV

TL;DR: StrCGAN是一种用于增强低分辨率天文图像的生成模型,通过3D卷积、多光谱融合和天体物理正则化模块,在无配对数据的情况下实现更真实、物理一致的高分辨率重建。

Details Motivation: 传统CycleGAN等模型在处理小望远镜拍摄的低质量天文图像时存在形态失真和仅限2D映射的问题,难以准确还原星体结构。 Method: 在CycleGAN基础上引入3D卷积层以捕捉体积空间相关性,融合光学与近红外多光谱信息,并加入天体物理正则化模块以保持星体形态;利用多任务巡天数据作为训练指导。 Result: StrCGAN生成的图像在视觉上更清晰,并在物理上保持一致性,在天文图像增强任务中优于标准GAN模型。 Conclusion: StrCGAN有效提升了低分辨率天文图像的质量,兼具视觉清晰度与物理合理性,适用于缺乏配对数据的真实天文观测场景。 Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.

[113] Adaptive Model Ensemble for Continual Learning

Yuchuan Mao,Zhi Gao,Xiaomeng Fan,Yuwei Wu,Yunde Jia,Chenchen Jing

Main category: cs.CV

TL;DR: 提出了一种名为meta-weight-ensembler的方法,通过元学习生成每层自适应的混合系数,解决持续学习中模型集成的任务级和层级别知识冲突,有效缓解灾难性遗忘。

Details Motivation: 现有模型集成方法在任务级和层级别存在知识冲突,导致新旧任务性能下降,因此需要一种能自适应融合不同任务知识的方法。 Method: 采用元学习训练一个混合系数生成器,为每个任务和每一层生成自适应的混合系数,用于模型参数的加权融合,从而实现知识的动态整合。 Result: 在多个持续学习数据集上实验表明,该方法有效缓解了灾难性遗忘,性能优于现有方法,达到最先进水平。 Conclusion: meta-weight-ensembler能灵活结合现有持续学习方法,显著提升其抗灾难性遗忘能力,适用于高效的新旧知识融合。 Abstract: Model ensemble is an effective strategy in continual learning, which alleviates catastrophic forgetting by interpolating model parameters, achieving knowledge fusion learned from different tasks. However, existing model ensemble methods usually encounter the knowledge conflict issue at task and layer levels, causing compromised learning performance in both old and new tasks. To solve this issue, we propose meta-weight-ensembler that adaptively fuses knowledge of different tasks for continual learning. Concretely, we employ a mixing coefficient generator trained via meta-learning to generate appropriate mixing coefficients for model ensemble to address the task-level knowledge conflict. The mixing coefficient is individually generated for each layer to address the layer-level knowledge conflict. In this way, we learn the prior knowledge about adaptively accumulating knowledge of different tasks in a fused model, achieving efficient learning in both old and new tasks. Meta-weight-ensembler can be flexibly combined with existing continual learning methods to boost their ability of alleviating catastrophic forgetting. Experiments on multiple continual learning datasets show that meta-weight-ensembler effectively alleviates catastrophic forgetting and achieves state-of-the-art performance.

[114] ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Tai-Ming Huang,Wei-Tung Lin,Kai-Lung Hua,Wen-Huang Cheng,Junichi Yamagishi,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 提出ThinkFake,一种基于推理的、可泛化的AI生成图像检测框架,利用多模态大语言模型与强化学习实现可解释的检测。

Details Motivation: 现有AI生成图像检测方法多依赖二分类且缺乏解释性,或需大量监督微调,泛化能力有限,难以应对日益严重的虚假信息和隐私问题。 Method: 提出ThinkFake框架,结合多模态大语言模型与伪造推理提示,采用组相对策略优化(GRPO)强化学习训练,并设计结构化检测流程以提升推理质量与适应性。 Result: 在GenImage基准上优于现有最先进方法,在LOKI基准上表现出强零样本泛化能力。 Conclusion: ThinkFake实现了可解释、可泛化的AI生成图像检测,有效提升了检测性能与鲁棒性。 Abstract: The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework's effectiveness and robustness. Code will be released upon acceptance.

[115] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents

Filippo Ziliotto,Jelin Raphael Akkara,Alessandro Daniele,Lamberto Ballan,Luciano Serafini,Tommaso Campari

Main category: cs.CV

TL;DR: PersONAL是一个用于研究具身AI中个性化的综合基准,要求智能体根据自然语言指令识别并导航到特定用户关联的物体。

Details Motivation: 在现实的人为中心场景中部署具身智能体面临挑战,尤其是难以建模个体人类偏好和行为。 Method: 构建包含2000多个高质量episode的PersONAL基准,涵盖30多个来自HM3D数据集的逼真家庭环境,并提供两种评估模式:在未知环境中主动导航和在已知场景中物体定位。 Result: 实验表明现有最先进基线与人类性能之间存在显著差距。 Conclusion: 需要能够感知、推理和记忆个性化信息的具身智能体,推动面向真实世界辅助机器人的发展。 Abstract: Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization, a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as "find Lily's backpack". PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot.

[116] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Xin Wang,Jie Li,Zejia Weng,Yixu Wang,Yifeng Gao,Tianyu Pang,Chao Du,Yan Teng,Yingchun Wang,Zuxuan Wu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉-语言-动作(VLA)模型的新型对抗攻击框架FreezeVLA,能够通过对抗图像使模型“冻结”,忽略后续指令,导致机器人在关键任务中失能。实验表明该攻击具有高成功率和强迁移性,揭示了VLA模型的安全隐患。

Details Motivation: VLA模型在机器人领域应用广泛,但其面对对抗攻击时的安全性和鲁棒性尚未充分研究。尤其是对抗图像可能导致模型完全停止响应指令,存在严重安全隐患,亟需系统性分析与评估。 Method: 提出FreezeVLA攻击框架,采用最小-最大双层优化方法生成可导致动作冻结的对抗图像,并在三个先进的VLA模型和四个机器人基准上进行评估。 Result: FreezeVLA在平均攻击成功率上达到76.2%,显著优于现有方法,且生成的对抗图像对不同语言指令具有强迁移性,单一图像即可跨提示引发模型瘫痪。 Conclusion: VLA模型存在严重的对抗冻结漏洞,可能威胁机器人系统的安全性,研究凸显了开发有效防御机制的紧迫性。 Abstract: Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can "freeze" VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot's digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.

[117] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection

Yunqing Hu,Zheming Yang,Chang Zhao,Wen Ji

Main category: cs.CV

TL;DR: 提出一种基于自适应引导的语义增强边缘-云协同目标检测方法,利用多模态大语言模型提升复杂场景下的检测性能。

Details Motivation: 传统方法在低光照和严重遮挡等复杂场景下因缺乏高层语义理解而性能下降。 Method: 通过指令微调使多模态大语言模型生成结构化场景描述,设计自适应映射机制将语义信息动态转换为边缘检测器的参数调整信号,并在边缘-云协同框架中根据置信度决定是否调用云端语义指导。 Result: 在低光和高度遮挡场景中,该方法可减少79%以上的延迟和70%的计算成本,同时保持检测精度。 Conclusion: 所提方法在复杂场景下实现了精度与效率的有效平衡,显著提升了边缘设备上的检测性能。 Abstract: Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.

[118] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation

Rémi Giraud,Rodrigo Borba Pinheiro,Yannick Berthoumieu

Main category: cs.CV

TL;DR: 本文提出了一种名为SphSPS的新型超像素方法,专用于360度球面或全向图像的分割,考虑了采集空间的几何特性,显著提升了分割精度、抗噪性和超像素形状规则性。

Details Motivation: 现有的超像素分割方法主要针对标准二维平面图像,难以有效处理广角或360度球面图像中的畸变和几何特性,因此需要一种适应球面几何的专用方法。 Method: 提出SphSPS方法,基于球面上像素与超像素中心之间的最短路径进行特征聚类,并推广了球面空间中的最短路径概念,同时引入了一种新的球面规则性度量指标。 Result: 在基准360度全景数据集和合成道路全向图像上验证,SphSPS在分割精度、抗噪性和规则性方面均显著优于现有的平面和球面最先进方法。 Conclusion: SphSPS通过尊重球面几何结构,有效提升了360度图像的超像素分割质量,为基于超像素的360度图像应用提供了有力工具。 Abstract: The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.

[119] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network

Pin-Jui Huang,Yu-Hsuan Liao,SooHeon Kim,NoSeong Park,JongBae Park,DongMyung Shin

Main category: cs.CV

TL;DR: 提出了一种新的细胞图像表征学习框架CWA-MSN,通过跨孔对齐嵌入来减少批次效应,提升表型建模效率,在更少数据和更小模型下优于现有方法。

Details Motivation: 现有自监督和对比学习方法在提取生物有意义且抗批次效应的细胞绘画表征方面存在挑战,通常需要大规模模型或大量数据。 Method: 提出Cross-Well Aligned Masked Siamese Network(CWA-MSN),在掩码Siamese架构中对同一扰动下不同孔中细胞的嵌入进行对齐,增强语义一致性。 Result: 在基因-基因关系检索基准上,CWA-MSN比OpenPhenom和CellCLIP分别提升29%和9%,且仅使用0.2M图像(vs. 2.2M)和22M参数(vs. 1.48B)。 Conclusion: CWA-MSN是一种简单有效的方法,能在有限数据和参数预算下高效学习细胞图像表征,适用于药物发现中的表型建模。 Abstract: Computational models that predict cellular phenotypic responses to chemical and genetic perturbations can accelerate drug discovery by prioritizing therapeutic hypotheses and reducing costly wet-lab iteration. However, extracting biologically meaningful and batch-robust cell painting representations remains challenging. Conventional self-supervised and contrastive learning approaches often require a large-scale model and/or a huge amount of carefully curated data, still struggling with batch effects. We present Cross-Well Aligned Masked Siamese Network (CWA-MSN), a novel representation learning framework that aligns embeddings of cells subjected to the same perturbation across different wells, enforcing semantic consistency despite batch effects. Integrated into a masked siamese architecture, this alignment yields features that capture fine-grained morphology while remaining data- and parameter-efficient. For instance, in a gene-gene relationship retrieval benchmark, CWA-MSN outperforms the state-of-the-art publicly available self-supervised (OpenPhenom) and contrastive learning (CellCLIP) methods, improving the benchmark scores by +29\% and +9\%, respectively, while training on substantially fewer data (e.g., 0.2M images for CWA-MSN vs. 2.2M images for OpenPhenom) or smaller model size (e.g., 22M parameters for CWA-MSN vs. 1.48B parameters for CellCLIP). Extensive experiments demonstrate that CWA-MSN is a simple and effective way to learn cell image representation, enabling efficient phenotype modeling even under limited data and parameter budgets.

[120] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering

Jiangxue Yu,Hui Wang,San Jiang,Xing Zhang,Dejin Zhang,Qingquan Li

Main category: cs.CV

TL;DR: 提出一种通过生成中间视图来缓解视角畸变的特征匹配算法,用于航空与地面图像的三维建模。

Details Motivation: 航空与地面图像间因视角差异大导致特征匹配困难,限制了复杂场景的三维重建。 Method: 首先利用航空图像通过增量式SfM构建稀疏模型,然后采用3D高斯点阵进行场景渲染,并设计渲染视角确定算法生成高质量中间图像,最后借助中间图像实现航空与地面图像间的可靠特征匹配。 Result: 实验表明该方法显著增加了初始和优化后的匹配数量,在真实数据集上实现了可靠的特征匹配,支持精确的ISfM重建和完整的3DGS场景渲染。 Conclusion: 所提方法有效解决了大视角变化下航空与地面图像的特征匹配难题,提升了三维建模的精度与完整性。 Abstract: The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes, which is seriously restricted by finding reliable correspondences. The primary contribution of this study is a feature matching algorithm for aerial and ground images, whose core idea is to generate intermediate views to alleviate perspective distortions caused by the extensive viewpoint changes. First, by using aerial images only, sparse models are reconstructed through an incremental SfM (Structure from Motion) engine due to their large scene coverage. Second, 3D Gaussian Splatting is then adopted for scene rendering by taking as inputs sparse points and oriented images. For accurate view rendering, a render viewpoint determination algorithm is designed by using the oriented camera poses of aerial images, which is used to generate high-quality intermediate images that can bridge the gap between aerial and ground images. Third, with the aid of intermediate images, reliable feature matching is conducted for match pairs from render-aerial and render-ground images, and final matches can be generated by transmitting correspondences through intermediate views. By using real aerial and ground datasets, the validation of the proposed solution has been verified in terms of feature matching and scene rendering and compared comprehensively with widely used methods. The experimental results demonstrate that the proposed solution can provide reliable feature matches for aerial and ground images with an obvious increase in the number of initial and refined matches, and it can provide enough matches to achieve accurate ISfM reconstruction and complete 3DGS-based scene rendering.

[121] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

Miren Samaniego,Igor Rodriguez,Elena Lazkano

Main category: cs.CV

TL;DR: 提出了一种基于胶囊网络的时空架构CapStARE,用于实时视线估计,结合ConvNeXt主干、带注意力路由的胶囊结构和双GRU解码器,在多个数据集上达到SOTA性能且具有良好的泛化性和可解释性。

Details Motivation: 为了实现高效、鲁棒且可解释的实时视线估计,解决现有方法在复杂场景下建模能力不足和参数量大的问题。 Method: 采用ConvNeXt作为主干网络,引入带注意力机制的胶囊结构进行部分-整体推理,并设计双GRU解码器分别建模慢速和快速的视线动态。 Result: 在ETH-XGaze(3.36)、MPIIFaceGaze(2.65)、Gaze360(9.06)和RT-GENE(4.76)等多个数据集上表现优异,推理时间小于10ms,参数更少且具备良好可解释性。 Conclusion: CapStARE是一种实用、高效的实时视线估计方案,在性能、速度和泛化能力之间取得了良好平衡,适用于交互式系统。 Abstract: We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare

[122] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes

Guo Chen,Jiarun Liu,Sicong Du,Chenming Wu,Deqi Li,Shi-Sheng Huang,Guofeng Zhang,Sheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为GS-RoadPatching的驾驶场景修复方法,利用3D高斯点阵(3DGS)进行基于结构匹配的替代性场景补全与编辑,无需依赖2D跨模态生成模型或重训练,实现了高效、高质量的3D场景修复。

Details Motivation: 现有基于3DGS的修复方法依赖2D视角下的扩散或GAN模型,受限于外观或深度预测,且需时空一致性,效率低。本文旨在通过直接在3DGS模态中实现替代式修复,克服这些限制。 Method: 构建特征嵌入的3DGS场景,引入多尺度局部上下文抽象的patch度量方法,并提出3D空间中的结构搜索策略以找到候选patch,最后采用替换-融合优化策略提升视觉和谐性。 Result: 在多个公开数据集上实验表明,该方法在修复质量与互操作性方面优于基线方法,达到SOTA水平;在通用场景中的实验也验证了其适用性。 Conclusion: GS-RoadPatching通过在3DGS空间内进行结构匹配与替代式修复,有效解决了驾驶场景中重复结构的补全问题,具有高效、免重训练、跨场景适用等优势。 Abstract: This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/

[123] Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

Edmund Bu,Yossi Gandelsman

Main category: cs.CV

TL;DR: 提出了一种通过分解神经元对输出的贡献路径来解释CLIP-ResNet中神经元的新技术,发现神经元-注意力头对可在嵌入空间中用单一方向近似,并可用于无训练语义分割和数据分布偏移监测。

Details Motivation: 理解CLIP-ResNet中神经元如何参与视觉-语言模型的推理过程,并寻找可解释的计算单元。 Method: 分析所有神经元与后续注意力头的成对组合,将其贡献分解为独立计算路径,并在图像-文本嵌入空间中寻找可解释的方向。 Result: 发现少数稀疏的神经元-头对主导输出贡献,部分对虽具多语义但表征子概念;并成功应用于无训练语义分割和数据分布偏移检测,性能优于先前方法。 Conclusion: 通过分析个体计算路径可揭示神经网络中可解释的功能单元,这些单元可用于实际下游任务。 Abstract: We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP's attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet's image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.

[124] When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset

Sarmistha Das,R E Zera Marveen Lyngkhoi,Kirtan Jain,Vinayak Goyal,Sriparna Saha,Manish Gupta

Main category: cs.CV

TL;DR: 本文提出了一个名为“视频投诉描述”(CoD-V)的新任务,旨在帮助用户通过视频更清晰地表达投诉,并发布了包含1175个投诉视频及描述的ComVID数据集。

Details Motivation: 用户在文本中难以清晰表达投诉,但可通过视频直观展示产品缺陷,因此需要一种方法将视频内容转化为有表现力的投诉描述。 Method: 提出ComVID数据集和新的投诉保留(CR)评估指标,并基于多模态检索增强生成(RAG)技术改进VideoLLaMA2-7b模型,结合用户情绪生成投诉描述。 Result: 对多种视频语言模型进行了全面评估,结果表明所提方法在METEOR、困惑度、可读性等指标上表现良好,且CR指标能有效区分标准视频描述与投诉描述任务。 Conclusion: 本研究开创了从视频生成投诉描述的新研究方向,为用户通过视频表达投诉提供了有效平台和资源支持。 Abstract: While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product' paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users' need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user's emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.

[125] SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

Phyo Thet Yee,Dimitrios Kollias,Sudeepta Mishra,Abhinav Dhall

Main category: cs.CV

TL;DR: 本文提出了一种名为SynchroRaMa的新框架,通过结合文本和音频的多模态情感嵌入,生成更具表现力和真实感的说话人脸视频。该方法利用语音、情感分析和大语言模型生成的场景描述,提升情感表达、头部运动自然性和口型同步效果,并在多个指标上优于现有方法。

Details Motivation: 现有情感感知的说话人脸生成方法大多依赖单一模态(音频或图像)进行情感建模,且通常仅使用单张参考图像,难以捕捉细腻的情感线索和动态变化。因此,需要一种能够融合多模态情感信息并支持动态属性生成的方法。 Method: SynchroRaMa结合来自文本(情感分析)和音频(语音情感识别及效价-唤醒特征)的情感信号,构建多模态情感嵌入;引入音频到动作(A2M)模块以实现准确的口型同步和自然头部运动;并利用大语言模型生成的场景描述作为额外文本输入,增强对动态动作和高层语义的建模能力。 Result: 在基准数据集上的定量与定性实验表明,SynchroRaMa在图像质量、表情保持和运动真实感方面优于现有最先进方法;用户研究表明其在整体自然度、动作多样性和视频流畅性上获得更高主观评分。 Conclusion: SynchroRaMa通过融合多模态情感信息和文本驱动的场景描述,显著提升了说话人脸视频的情感表现力和时序一致性,为未来人机交互中更自然的虚拟角色生成提供了有效方案。 Abstract: Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model's ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at .

[126] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Pei Liu,Hongliang Lu,Haichao Liu,Haipeng Liu,Xin Liu,Ruoyu Yao,Shengbo Eben Li,Jun Ma

Main category: cs.CV

TL;DR: 提出一种类人框架OmniScene,通过视觉-语言模型和分层融合策略实现4D场景理解,在nuScenes数据集上超越现有方法。

Details Motivation: 当前自动驾驶系统缺乏类似人类的三维场景理解能力,主要依赖深度重建而非真正意义上的场景理解。 Method: 提出OmniScene框架,包括OmniVLM视觉语言模型和教师-学生架构进行知识蒸馏,并将文本表示嵌入3D实例特征中;采用分层融合策略自适应整合几何与语义特征。 Result: 在nuScenes数据集上评测显示,该方法在感知、预测、规划和视觉问答等任务上均优于十余种最先进模型。 Conclusion: OmniScene实现了更接近人类的感知-理解-行动架构,提升了多模态信息融合效果,推动了自动驾驶系统的场景理解能力。 Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

[127] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion

Chenhao Ji,Chaohui Yu,Junyao Gao,Fan Wang,Cairong Zhao

Main category: cs.CV

TL;DR: 本文提出CamPVG,首个基于扩散模型的全景视频生成框架,通过精确相机姿态引导生成几何一致的全景视频。

Details Motivation: 现有方法主要集中在透视投影视频生成中的相机控制,而全景视频生成在姿态表示和球面投影上存在几何一致性难题。 Method: 提出全景Plücker嵌入进行相机外参编码,并基于球面坐标变换实现相机位置编码;引入球面极线模块,通过沿极线的自适应注意力掩码施加几何约束,实现细粒度的跨视角特征聚合。 Result: 实验表明,该方法能生成与相机轨迹高度一致的高质量全景视频,在生成质量和一致性方面显著优于现有方法。 Conclusion: CamPVG有效解决了全景视频生成中的几何一致性问题,为相机控制下的全景视频生成提供了新的解决方案。 Abstract: Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Pl\"ucker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.

[128] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments

Yihao Hu,Pan Wang,Xiaodong Bai,Shijie Cai,Hang Wang,Huazhong Liu,Aiping Yang,Xiangxiang Li,Meiping Ding,Hongyan Liu,Jianguo Yao

Main category: cs.CV

TL;DR: 本文提出了一种用于复杂果园环境下沙田柚检测的SDE-DET模型,并构建了自定义数据集STP-AgriData。该模型结合Star Block、可变形注意力和多尺度注意力机制,有效解决了遮挡、小目标和多尺度等问题,在多个指标上达到SOTA性能,为自动化采摘机器人提供了可靠的技术支持。

Details Motivation: 在复杂的果园环境中,沙田柚的检测面临多尺度、枝叶遮挡和小目标等挑战,传统方法难以满足高精度检测需求,因此需要开发更鲁棒的检测模型以支持自动化采摘与成熟度分析。 Method: 提出SDE-DET模型:采用Star Block获取高维信息,使用Deformable Attention增强遮挡条件下的检测能力,并集成多个Efficient Multi-Scale Attention机制以降低计算开销并提升小目标检测性能;同时构建了专用数据集STP-AgriData进行训练与验证。 Result: SDE-DET在STP-AgriData数据集上取得了0.883的Precision、0.771的Recall、0.838的mAP@0.5、0.497的mAP@0.5:0.95和0.823的F1-score,性能优于YOLO系列及其他主流检测模型,达到当前最优水平。 Conclusion: SDE-DET模型在沙田柚检测任务中表现出色,能够有效应对复杂果园环境中的检测难题,为后续农业自动化,特别是机器人自动采摘系统的发展奠定了基础。 Abstract: Pomelo detection is an essential process for their localization, automated robotic harvesting, and maturity analysis. However, detecting Shatian pomelo in complex orchard environments poses significant challenges, including multi-scale issues, obstructions from trunks and leaves, small object detection, etc. To address these issues, this study constructs a custom dataset STP-AgriData and proposes the SDE-DET model for Shatian pomelo detection. SDE-DET first utilizes the Star Block to effectively acquire high-dimensional information without increasing the computational overhead. Furthermore, the presented model adopts Deformable Attention in its backbone, to enhance its ability to detect pomelos under occluded conditions. Finally, multiple Efficient Multi-Scale Attention mechanisms are integrated into our model to reduce the computational overhead and extract deep visual representations, thereby improving the capacity for small object detection. In the experiment, we compared SDE-DET with the Yolo series and other mainstream detection models in Shatian pomelo detection. The presented SDE-DET model achieved scores of 0.883, 0.771, 0.838, 0.497, and 0.823 in Precision, Recall, mAP@0.5, mAP@0.5:0.95 and F1-score, respectively. SDE-DET has achieved state-of-the-art performance on the STP-AgriData dataset. Experiments indicate that the SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for the further development of automatic harvest robots.

[129] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang,Jiahan Zhang,Shengjie Zhou,Qi Wei,Shuo He,Feng Liu,Lei Feng

Main category: cs.CV

TL;DR: 本文提出了一种新的代理目标攻击方法(PTA),通过多源模态和目标模态代理生成更具泛化性和隐蔽性的对抗样本,有效提升了在多模态预训练模型上的目标攻击性能。

Details Motivation: 现有针对多模态预训练模型的目标攻击在泛化性和隐蔽性方面存在局限,难以应对部分已知或语义相似的目标,且易被异常检测方法发现。 Method: 提出代理目标攻击(PTA),利用多个源模态和目标模态的代理来优化对抗样本,并结合理论分析平衡泛化性与隐蔽性。 Result: 实验结果表明,PTA在多种相关目标上实现了高攻击成功率,同时能有效规避多种异常检测方法,具备良好的泛化性和隐蔽性。 Conclusion: PTA显著提升了多模态预训练模型上目标攻击的实用性和鲁棒性,为评估此类模型的安全性提供了新思路。 Abstract: Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.

[130] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture

Nico Schulthess,Ender Konukoglu

Main category: cs.CV

TL;DR: 提出使用DINOv2嵌入与Dirichlet Process Mixture Model(DPMM)结合的方法,用于医学图像无监督异常检测,在保持高性能的同时显著降低推理计算时间。

Details Motivation: 现有基于记忆库的异常检测方法在大规模医学数据集上计算开销大,不具可扩展性,因此需要一种更高效且自适应的建模方式来处理大规模正常特征分布。 Method: 利用在自然图像上预训练的DINOv2模型生成嵌入,并采用非参数化的Dirichlet Process Mixture Model(DPMM)对正常样本的嵌入分布进行建模,通过计算测试样本与混合成分中心的相似性作为异常评分,生成粗略的异常分割掩码。 Result: 该方法在医学图像异常检测基准上表现出具有竞争力的性能,尽管DINOv2是在自然图像上训练的;同时推理时间至少减少一半;归一化后的DINOv2嵌入比未归一化的特征更符合解剖结构,更适合异常检测任务。 Conclusion: DPMM能有效建模DINOv2嵌入的分布,实现高效、自动调整复杂度的无监督异常检测,适用于大规模医学影像数据,且归一化嵌入有助于提升对解剖结构的对齐和检测鲁棒性。 Abstract: In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.

[131] Table Detection with Active Learning

Somraj Gautam,Nachiketa Purohit,Gaurav Harit

Main category: cs.CV

TL;DR: 提出了一种结合多样性的主动学习方法,用于减少目标检测任务中的标注成本,同时保持与全监督模型相当的性能。

Details Motivation: 为了降低机器学习中目标检测任务的标注成本,克服传统不确定性采样方法的局限性。 Method: 在主动学习中引入多样性策略,选择具有代表性和信息量的样本进行标注,提升模型泛化能力。 Result: 在TableBank-LaTeX和TableBank-Word数据集上使用CascadeTabNet和YOLOv9验证,相比随机采样,在相同标注预算下实现了更高的mAP分数。 Conclusion: 所提方法能有效减少标注工作量,同时保持良好检测性能,显著优于随机采样。 Abstract: Efficient data annotation remains a critical challenge in machine learning, particularly for object detection tasks requiring extensive labeled data. Active learning (AL) has emerged as a promising solution to minimize annotation costs by selecting the most informative samples. While traditional AL approaches primarily rely on uncertainty-based selection, recent advances suggest that incorporating diversity-based strategies can enhance sampling efficiency in object detection tasks. Our approach ensures the selection of representative examples that improve model generalization. We evaluate our method on two benchmark datasets (TableBank-LaTeX, TableBank-Word) using state-of-the-art table detection architectures, CascadeTabNet and YOLOv9. Our results demonstrate that AL-based example selection significantly outperforms random sampling, reducing annotation effort given a limited budget while maintaining comparable performance to fully supervised models. Our method achieves higher mAP scores within the same annotation budget.

[132] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression

Xuekang Zhu,Ji-Zhe Zhou,Kaiwen Feng,Chenfan Qu,Yunfei Wang,Liting Zhou,Jian liu

Main category: cs.CV

TL;DR: 本文首次将图像篡改定位重构为条件序列预测任务,提出RITA框架,逐层预测篡改区域,显式建模编辑操作间的时间依赖性和层次结构。

Details Motivation: 现有图像篡改定位方法采用单次预测范式,忽略复杂的编辑过程的序列性和层次性,导致严重的维度崩溃,与任务本质不匹配。 Method: 提出RITA框架,通过将篡改定位视为条件序列预测任务,逐层、有序地预测篡改区域,并利用每一步的输出作为下一步的条件;构建新的多步篡改数据集HSIM和评估指标HSS。 Result: 实验表明,RITA在传统基准上达到SOTA性能,并在新提出的层次化定位任务上表现优异,验证了其作为通用有效范式的潜力。 Conclusion: RITA通过建模篡改过程的时序与层次结构,显著提升了定位性能,为图像篡改检测提供了新的范式。 Abstract: Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step's prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available.

[133] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

Manahil Raza,Ayesha Azam,Talha Qaiser,Nasir Rajpoot

Main category: cs.CV

TL;DR: 本文提出了一种名为PS3的三模态Transformer模型,通过整合病理报告、全切片图像和转录组数据的原型表示,提升了癌症生存预测性能。

Details Motivation: 现有融合方法主要结合影像与基因组数据,而病理报告包含丰富的临床解释信息,尚未被充分用于多模态融合。作者假设引入病理报告可提升预后预测效果。 Method: 提出基于原型的表示方法:从病理报告中提取诊断原型(利用自注意力),从WSI中提取组织学原型,从转录组数据中提取生物学通路原型;使用Transformer模型融合三种原型进行生存预测。 Result: PS3在TCGA的六个数据集上优于当前最先进的临床、单模态和多模态基线方法,验证了引入病理报告的有效性。 Conclusion: 通过构建平衡的多模态原型表示,PS3有效解决了异构模态融合中的不平衡问题,显著提升了生存预测性能,证明了病理报告在计算肿瘤学中的重要价值。 Abstract: Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.

[134] Generative Adversarial Networks Applied for Privacy Preservation in Biometric-Based Authentication and Identification

Lubos Mjachky,Ivan Homoliak

Main category: cs.CV

TL;DR: 提出一种基于生成对抗网络(GAN)的新型认证方法,将人脸图像转换为视觉私有域(如花朵或鞋子),在保护用户隐私的同时保持认证的有效性。

Details Motivation: 现有生物特征认证系统无法让用户控制其数据使用方式,且存在数据泄露和滥用风险。 Method: 利用生成对抗网络(GAN)将人脸图像转换到视觉私有域,并在此私有域图像上训练用于认证的分类器。 Result: 实验表明该方法对攻击具有鲁棒性,同时仍能提供有效的认证性能。 Conclusion: 所提出的方法能够在保护用户隐私的前提下实现安全且可用的生物特征认证。 Abstract: Biometric-based authentication systems are getting broadly adopted in many areas. However, these systems do not allow participating users to influence the way their data is used. Furthermore, the data may leak and can be misused without the users' knowledge. In this paper, we propose a new authentication method that preserves the privacy of individuals and is based on a generative adversarial network (GAN). Concretely, we suggest using the GAN for translating images of faces to a visually private domain (e.g., flowers or shoes). Classifiers, which are used for authentication purposes, are then trained on the images from the visually private domain. Based on our experiments, the method is robust against attacks and still provides meaningful utility.

[135] Predictive Quality Assessment for Mobile Secure Graphics

Cas Steigstra,Sergey Milyaev,Shaodi You

Main category: cs.CV

TL;DR: 提出一种轻量级框架,通过预测视频帧的质量来提升智能手机上安全图形验证的可靠性,解决因图像采集质量差导致的高误拒率问题。

Details Motivation: 智能手机采集的安全图形验证图像质量不稳定,导致高误拒率,形成‘可靠性 gap’,影响反假冒技术的有效性。 Method: 引入一个轻量级模型,用于预测视频帧的质量分数,筛选适合交由资源密集型 oracle 模型处理的帧;采用基于冻结的ImageNet预训练网络的跨域分析方法。 Result: 在超过32,000张图像的大规模数据集上验证了框架有效性,并通过重新定义的FNMR和ISRR指标证明性能提升;跨域分析显示,冻结主干网络比完全微调更具泛化能力。 Conclusion: 对于物理制造引起的域偏移,使用冻结的通用主干网络比完全微调更鲁棒,能更好应对未见打印技术的挑战。 Abstract: The reliability of secure graphic verification, a key anti-counterfeiting tool, is undermined by poor image acquisition on smartphones. Uncontrolled user captures of these high-entropy patterns cause high false rejection rates, creating a significant 'reliability gap'. To bridge this gap, we depart from traditional perceptual IQA and introduce a framework that predictively estimates a frame's utility for the downstream verification task. We propose a lightweight model to predict a quality score for a video frame, determining its suitability for a resource-intensive oracle model. Our framework is validated using re-contextualized FNMR and ISRR metrics on a large-scale dataset of 32,000+ images from 105 smartphones. Furthermore, a novel cross-domain analysis on graphics from different industrial printing presses reveals a key finding: a lightweight probe on a frozen, ImageNet-pretrained network generalizes better to an unseen printing technology than a fully fine-tuned model. This provides a key insight for real-world generalization: for domain shifts from physical manufacturing, a frozen general-purpose backbone can be more robust than full fine-tuning, which can overfit to source-domain artifacts.

[136] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

Yuxi Zheng,Jianhui Feng,Tianran Li,Marius Staring,Yuchuan Qiao

Main category: cs.CV

TL;DR: 本文提出了一种用于可变形图像配准(DIR)的新型专家引导网络SHMoAReg,首次将混合专家(MoE)机制引入编码器和解码器中,通过混合注意力头(MoA)增强特征提取的专业性,并通过空间异质混合专家(SHMoE)在三个方向上异质地预测形变场,显著提升了配准性能与模型可解释性。

Details Motivation: 现有DIR方法缺乏对配准有用特征的专门提取,且在三个方向上均一地联合预测形变场,限制了性能提升。 Method: 在编码器中引入混合注意力头(MoA),动态选择最优注意力头组合以增强特征提取;在解码器中采用空间异质混合专家(SHMoE),使用不同卷积核大小的专家在每个体素的三个方向上异质地预测形变场。 Result: 在两个公开数据集上实验表明,SHMoAReg consistently 超过多种对比方法,在腹部CT数据集上的Dice分数从60.58%提升至65.58%,并增强了模型可解释性,能够区分不同分辨率层间/内的专家用途差异。 Conclusion: SHMoAReg首次将MoE机制应用于DIR任务,在特征提取和形变预测方面实现了更专业化和异质化的建模,显著提升了配准精度与模型可解释性。 Abstract: Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts' utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.

[137] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Zizheng Yang,Hu Yu,Bing Li,Jinghao Zhang,Jie Huang,Feng Zhao

Main category: cs.CV

TL;DR: 提出了一种基于预训练扩散模型语义潜在空间的图像去雾方法DiffLI²D,避免了重新训练和迭代采样,显著提升去雾性能。

Details Motivation: 扩散模型在图像去雾中表现出强大的生成能力,但其重训练开销大且推理时采样步骤多,限制了实际应用。因此,需要一种高效利用预训练扩散模型的方法来解决这些问题。 Method: 探索预训练扩散模型在语义潜在空间中对雾霾图像的表示特性,设计了一个融合不同时间步扩散潜在表示的去雾网络DiffLI²D,利用冻结的扩散模型提取特征并指导去雾过程。 Result: 在多个数据集上实验表明,该方法优于现有的图像去雾方法,实现了更优的去雾效果,同时避免了扩散模型的重训练和迭代采样。 Conclusion: DiffLI²D为引入扩散模型进行图像去雾提供了新视角,通过有效利用预训练模型的潜在表示,在降低计算成本的同时取得了卓越的去雾性能。 Abstract: Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.

[138] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

JuanaJuana Valeria Hurtado,Rohit Mohan,Abhinav Valada

Main category: cs.CV

TL;DR: 提出一种新的高光谱适配器,利用预训练视觉基础模型有效学习高光谱数据,在自动驾驶数据集上实现了最先进的语义分割性能。

Details Motivation: 现有高光谱图像语义分割方法因依赖为RGB输入优化的架构和学习框架而表现不佳。 Method: 引入光谱变换器和光谱感知空间先验模块提取丰富的空谱特征,并通过模态感知交互块实现高光谱表征与冻结视觉Transformer特征的有效融合。 Result: 在三个自动驾驶基准数据集上验证了该方法的优越性,显著优于现有视觉和高光谱分割方法。 Conclusion: 所提出的高光谱适配器能有效利用预训练视觉模型处理高光谱数据,推动了复杂环境下的机器人感知能力。 Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hyperspectraladapter.cs.uni-freiburg.de.

[139] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

Belal Shoer,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 本文提出了一种新的数据合成方法,将现有的图像-文本对转换为统一的“文本嵌入图像”格式,用于科学视觉问答任务。通过在合成数据和EXAMS-V数据上微调小型多语言多模态模型,显著提升了13种语言下的零样本性能,展现出强大的跨语言迁移能力。

Details Motivation: 科学图表及其多模态上下文的复杂性使得视觉-语言模型在科学视觉问答任务中面临挑战。现有方法将图像与文本分开处理,限制了模型表现,而EXAMS-V虽提出将文本嵌入图像的新范式,但缺乏足够训练数据,导致模型性能不佳,尤其在零样本设置下。因此,需要解决该新格式下的数据稀缺问题。 Method: 通过将现有的分离式图像-文本对合成为包含图文的单一图像,构建新的训练数据集,并在此基础上结合EXAMS-V数据,对小型多语言多模态模型进行微调,以提升在多种语言下的科学视觉问答性能。 Result: 在13种语言上实现了显著的平均性能提升,验证了所提数据合成方法的有效性以及模型的跨语言迁移能力,尤其在零样本设置下优于现有方法。 Conclusion: 通过合成‘文本嵌入图像’格式的训练数据并进行任务特定微调,可有效提升多语言视觉-语言模型在科学视觉问答任务中的表现,为解决该领域数据稀缺问题提供了可行方案。 Abstract: Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this "text-in-image" format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

[140] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

Botai Yuan,Yutian Zhou,Yingjie Wang,Fushuo Huo,Yongcheng Jing,Li Shen,Ying Wei,Zhiqi Shen,Ziwei Liu,Tianwei Zhang,Jie Yang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了EchoBench,一个用于系统评估医疗大视觉语言模型(LVLMs)中“附和性”(sycophancy)问题的基准测试。研究发现现有模型普遍存在严重附和倾向,尤其在高风险临床场景中可能影响可靠性与安全性。实验涵盖多种模型类型,并分析了不同因素对附和性的干扰。结果表明高质量数据和领域知识可降低附和性,且简单提示干预能有效缓解该问题,强调了超越准确率的鲁棒性评估的重要性。

Details Motivation: 当前医学LVLM的评测过于关注准确率,忽视了模型在临床环境中的可靠性与安全性问题。本文旨在探究模型对用户输入的盲目附和(sycophancy)现象,尤其是在患者、医学生或医生提供有偏信息时的表现,以揭示潜在风险。 Method: 构建了一个名为EchoBench的新基准,包含2,122张图像,覆盖18个科室和20种模态,设计了90个模拟有偏输入的提示。评估了医学专用、开源及专有LVLM,并进行细粒度分析(如偏差类型、科室、感知粒度和模态)。同时测试了多种提示级干预方法(如负向提示、单样本、少样本)以探索缓解策略。 Result: 所有被测模型均表现出显著的附和性,最佳专有模型Claude 3.7 Sonnet为45.98%,GPT-4.1达59.15%,部分医学专用模型超过95%。高质量/多样化数据和强领域知识可降低附和性而不损害无偏准确率。简单的提示干预能一致地减少附和行为,为训练和解码阶段的改进提供了方向。 Conclusion: 仅依赖准确率评估医疗LVLM是不足的,必须考虑其在面对有偏输入时的可靠性。EchoBench为评估和缓解附和性提供了有效工具,研究结果呼吁加强安全性和可信度评估,并提出可行的改进路径。 Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy -- models' tendency to uncritically echo user-provided information -- in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

[141] Smaller is Better: Enhancing Transparency in Vehicle AI Systems via Pruning

Sanish Suwal,Shaurya Garg,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi

Main category: cs.CV

TL;DR: 本文研究了三种训练方法(自然训练、对抗训练和剪枝)对交通标志分类器事后解释质量的影响,发现剪枝能显著提升解释的可理解性和保真度。

Details Motivation: 由于AI模型在自动驾驶中的关键作用,其透明性和安全性至关重要,但现有事后解释方法存在不一致和不可靠的问题。 Method: 通过大量实证评估,分析自然训练、对抗训练和剪枝对基于显著性图的事后解释质量的影响。 Result: 剪枝不仅提高了模型效率,还增强了学习表示的稀疏性,从而提升了解释的保真性和可理解性。 Conclusion: 剪枝是一种有前景的策略,有助于构建高效且透明的深度学习模型,尤其适用于资源受限的车载AI系统。 Abstract: Connected and autonomous vehicles continue to heavily rely on AI systems, where transparency and security are critical for trust and operational safety. Post-hoc explanations provide transparency to these black-box like AI models but the quality and reliability of these explanations is often questioned due to inconsistencies and lack of faithfulness in representing model decisions. This paper systematically examines the impact of three widely used training approaches, namely natural training, adversarial training, and pruning, affect the quality of post-hoc explanations for traffic sign classifiers. Through extensive empirical evaluation, we demonstrate that pruning significantly enhances the comprehensibility and faithfulness of explanations (using saliency maps). Our findings reveal that pruning not only improves model efficiency but also enforces sparsity in learned representation, leading to more interpretable and reliable decisions. Additionally, these insights suggest that pruning is a promising strategy for developing transparent deep learning models, especially in resource-constrained vehicular AI systems.

[142] C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis

Min Cen,Zhenfeng Zhuang,Yuzhe Zhang,Min Zeng,Baptiste Magnier,Lequan Yu,Hong Zhang,Liansheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于双重因果图的MIL模型C²MIL,用于解决H&E染色全切片图像生存分析中因染色差异和无关拓扑结构带来的语义偏倚与噪声问题,提升了模型的泛化性与可解释性。

Details Motivation: 现有基于图的MIL方法在处理H&E染色WSI时易受染色变异和非相关拓扑结构影响,导致语义偏倚和噪声,影响模型的可解释性和泛化能力。 Method: 提出C²MIL模型,基于双重结构因果模型,引入跨尺度自适应特征解耦模块进行语义因果干预,并设计伯努利可微因果子图采样方法进行拓扑因果发现,结合解耦监督与对比学习实现联合优化。 Result: 实验表明C²MIL在多个基准上优于现有方法,显著提升模型的泛化能力和可解释性,并可作为多种MIL基线的因果增强模块。 Conclusion: C²MIL通过双重因果建模有效缓解了语义和拓扑层面的偏差,为基于MIL的病理图像分析提供了更鲁棒且可解释的解决方案。 Abstract: Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H\&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.

[143] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT

Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li

Main category: cs.CV

TL;DR: 本文提出了一种用于CBCT图像中牙齿和牙髓分割的半监督学习框架U-Mamba2-SSL,结合自监督预训练、一致性正则化与伪标签策略,在仅使用少量标注数据的情况下实现了高性能分割。

Details Motivation: 准确的牙齿和牙髓分割对临床诊断和治疗至关重要,但手动分割耗时且依赖专家经验,因此需要利用大量未标注数据实现自动化分割。 Method: 提出U-Mamba2-SSL框架:首先通过破坏性自编码器对U-Mamba2进行自监督预训练;然后在未标注数据上采用输入与特征扰动的一致性正则化;最后结合降低损失权重的伪标签策略进行多阶段训练。 Result: 在验证集上取得了0.872的平均分数和0.969的DSC分数,显著优于现有方法,验证了该方法的有效性和鲁棒性。 Conclusion: U-Mamba2-SSL通过有效的半监督策略充分利用未标注数据,在牙齿和牙髓分割任务中表现出卓越性能,具有良好的临床应用前景。 Abstract: Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.

[144] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research

Patricia Schöntag,David Nakath,Judith Fischer,Rüdiger Röttgers,Kevin Köser

Main category: cs.CV

TL;DR: 本文提出了“光学海洋配方”框架,通过受控的水下环境生成逼真的数据集,以解决机器视觉在不同水质条件下缺乏可重复性和泛化性的问题。

Details Motivation: 由于水下环境中存在颜色失真、对比度降低、散射和动态光照等光学挑战,且现有评估多局限于特定条件,缺乏普适性,因此需要一种可控且可重复的测试方法。 Method: 提出“光学海洋配方”框架,使用校准的颜色和散射添加剂,在实验室中模拟真实海洋光学条件,生成具有地面真值的多样化数据集。 Result: 成功创建了一个示范数据集,并展示了该系统在水参数估计和图像恢复等任务中的应用,验证了其在多种水下视觉任务中的有效性。 Conclusion: 该框架为水下机器视觉提供了一种可重复、可控且贴近真实的评估手段,有助于提升模型在不同环境下的泛化能力,推动相关技术的发展。 Abstract: The development and evaluation of machine vision in underwater environments remains challenging, often relying on trial-and-error-based testing tailored to specific applications. This is partly due to the lack of controlled, ground-truthed testing environments that account for the optical challenges, such as color distortion from spectrally variant light attenuation, reduced contrast and blur from backscatter and volume scattering, and dynamic light patterns from natural or artificial illumination. Additionally, the appearance of ocean water in images varies significantly across regions, depths, and seasons. However, most machine vision evaluations are conducted under specific optical water types and imaging conditions, therefore often lack generalizability. Exhaustive testing across diverse open-water scenarios is technically impractical. To address this, we introduce the \textit{Optical Ocean Recipes}, a framework for creating realistic datasets under controlled underwater conditions. Unlike synthetic or open-water data, these recipes, using calibrated color and scattering additives, enable repeatable and controlled testing of the impact of water composition on image appearance. Hence, this provides a unique framework for analyzing machine vision in realistic, yet controlled underwater scenarios. The controlled environment enables the creation of ground-truth data for a range of vision tasks, including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis. We provide a demonstration dataset generated using the Optical Ocean Recipes and briefly demonstrate the use of our system for two underwater vision tasks. The dataset and evaluation code will be made available.

[145] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving

Dehong Kong,Sifan Yu,Siyuan Liang,Jiawei Liang,Jianhou Gan,Aishan Liu,Wenqi Ren

Main category: cs.CV

TL;DR: 提出首个用于自动驾驶视觉语言模型(VLM-AD)的通用伪装攻击框架UCA,通过在特征空间中优化生成物理可实现的伪装纹理,显著提升对抗攻击的泛化性和鲁棒性。

Details Motivation: 现有对抗攻击难以有效应用于VLM-AD系统:物理攻击主要针对底层感知模块,难以迁移;数字攻击缺乏实际可行性。因此需要一种能跨指令和模型架构泛化的物理攻击方法。 Method: UCA在特征空间操作,引入特征分歧损失(FDL)以扩大干净与对抗图像的表征差异;采用多尺度学习策略并调整采样比例,增强对尺度变化和视角多样性的适应性。 Result: 实验表明UCA可在多种VLM-AD模型和驾驶场景中诱导错误驾驶指令,3-P指标上比现有最先进方法提升30%,且在不同视角和动态条件下表现出强攻击鲁棒性。 Conclusion: UCA是首个面向VLM-AD的通用物理伪装攻击框架,具备良好的跨模型与跨指令泛化能力,揭示了VLM-AD系统在真实场景下的安全隐患,具有实际部署潜力。 Abstract: Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30\% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.

[146] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation

Mahmoud Khater,Mona Strauss,Philipp von Olshausen,Alexander Reiterer

Main category: cs.CV

TL;DR: 提出PU-Gaussian,一种基于各向异性3D高斯分布的点云上采样网络,通过显式建模局部几何结构实现高质量上采样,并在多个数据集上达到SOTA性能。

Details Motivation: 现有方法在处理稀疏噪声点云时,往往牺牲了几何可解释性或对输入稀疏性的鲁棒性,因此需要一种既能保持几何结构又能提升上采样质量的方法。 Method: 使用各向异性3D高斯分布建模每个点的局部邻域,通过直接点采样在局部几何域内进行显式上采样,并引入细化网络优化输出分布和边缘清晰度。 Result: 在PU1K和PUGAN数据集上取得了最先进的性能,生成了更密集、均匀且边缘更清晰的点云。 Conclusion: PU-Gaussian通过结合几何感知的高斯建模与两阶段上采样策略,在保持良好几何解释性的同时显著提升了点云上采样质量。 Abstract: Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.

[147] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Tom Burgert,Oliver Stoll,Paolo Rota,Begüm Demir

Main category: cs.CV

TL;DR: 该论文重新审视了卷积神经网络(CNN)具有内在纹理偏好的假设,提出了一种领域无关的框架,通过系统抑制形状、纹理和颜色线索来量化特征依赖。研究发现,CNN主要依赖局部形状特征而非纹理,且这种依赖可通过现代训练策略或架构(如ConvNeXt、ViTs)减弱。跨领域的分析表明,不同任务中的模型特征偏好存在系统性差异:计算机视觉模型偏好形状,医学图像模型强调颜色,遥感模型更依赖纹理。

Details Motivation: 此前认为CNN具有内在纹理偏好的观点源于Geirhos等人使用的cue-conflict实验,但该实验存在强制选择冲突等混淆因素。本文旨在克服这些实验局限,更准确地评估CNN对形状、纹理和颜色等特征的实际依赖。 Method: 提出一种领域无关的特征依赖量化框架,通过在受控条件下系统性地抑制输入中的形状、纹理和颜色线索,分别评估人类与神经网络的表现,避免传统cue-conflict实验中的决策偏差。 Result: 发现CNN并非本质上偏向纹理,而是更依赖局部形状特征;现代架构(如ConvNeXt、ViTs)可显著降低这种形状依赖;跨领域分析显示,计算机视觉模型偏好形状,医学影像模型更关注颜色,而遥感模型则更依赖纹理。 Conclusion: CNN的特征依赖并非固定为纹理偏好,而是受训练方法和架构影响;不同应用领域的模型表现出系统性的特征使用差异,应根据任务需求设计模型归纳偏置。 Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.

[148] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation

Kwang-Hyun Uhm,Hyunjun Cho,Sung-Hoo Hong,Seung-Won Jung

Main category: cs.CV

TL;DR: 本文提出了一种新的跨视角纹理迁移方法,用于利用3D CT体积的各向异性特性进行CT切片插值。

Details Motivation: 由于CT图像在临床实践中通常以较大的层厚获取,导致层面间分辨率较低,影响疾病诊断,因此需要提高层面间的分辨率。 Method: 设计了一个独特的框架,将高分辨率面内纹理细节作为参考,并将其迁移到低分辨率的层面图像上,引入了多参考非局部注意力模块来重建层面高频细节。 Result: 实验表明,该方法在包括真实配对基准在内的公共CT数据集上显著优于现有的竞争方法。 Conclusion: 所提出的框架有效提升了CT切片插值的质量,验证了充分利用3D CT体积各向异性特征的重要性。 Abstract: Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.

[149] 4D Driving Scene Generation With Stereo Forcing

Hao Lu,Zhuang Ma,Guangfeng Jiang,Wenhang Ge,Bohan Li,Yuzhan Cai,Wenzhao Zheng,Yunpeng Zhang,Yingcong Chen

Main category: cs.CV

TL;DR: PhiGenesis 是一个统一的4D场景生成框架,通过结合视频生成技术与几何和时间一致性,实现动态驾驶场景的时空连续生成与新视角合成。

Details Motivation: 现有生成模型难以在无需每场景优化的情况下同时支持时间外推和空间新视角合成,本文旨在弥合生成模型与新视角合成之间的差距。 Method: PhiGenesis 包含两个阶段:第一阶段利用预训练的视频VAE和新的range-view适配器进行前馈式4D重建;第二阶段引入几何引导的视频扩散模型,并采用Stereo Forcing策略,在去噪过程中结合几何不确定性来增强生成结果的时空一致性。 Result: 实验结果表明,PhiGenesis 在外观与几何重建、时间生成和新视角合成任务上均达到最先进水平,并在下游任务中表现出竞争力。 Conclusion: PhiGenesis 成功实现了高质量、时空连贯的4D驾驶场景生成,兼具动态外推与新视角合成能力,为自动驾驶仿真等应用提供了有效解决方案。 Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.

[150] A Versatile Foundation Model for AI-enabled Mammogram Interpretation

Fuxiang Huang,Jiayi Zhu,Yunfang Yu,Yu Xie,Yuan Guo,Qingcong Kong,Mingxiang Wu,Xinrui Jiang,Shu Yang,Jiabo Ma,Ziyi Liu,Zhe Xu,Zhixuan Chen,Yujie Tan,Zifan He,Luhui Mao,Xi Wang,Junlin Hou,Lei Zhang,Qiong Luo,Zhenhui Li,Herui Yao,Hao Chen

Main category: cs.CV

TL;DR: VersaMammo是一个用于乳腺X线摄影的多功能基础模型,通过大规模多机构数据集和两阶段预训练策略,在92项临床相关任务中表现出卓越的泛化能力和性能,显著推动了乳腺癌筛查与诊断的可靠性与可扩展性。

Details Motivation: 现有乳腺X线基础模型受限于训练数据多样性不足、泛化能力有限以及缺乏全面的临床任务评估,难以实现临床转化。 Method: 构建包含706,239张图像的多中心数据集,采用自监督学习训练教师模型提取特征,再通过监督学习结合知识蒸馏将特征与临床知识迁移至VersaMammo。 Result: 在包含68项内部任务和24项外部验证任务的基准上,VersaMammo在50项内部任务和20项外部任务中排名第一,平均排名分别为1.5和1.2。 Conclusion: VersaMammo展现出卓越的泛化能力和临床实用性,是迈向可靠、可扩展乳腺癌筛查与诊断的重要进展。 Abstract: Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.

[151] A co-evolving agentic AI system for medical imaging analysis

Songhao Li,Jonathan Xu,Tiancheng Bao,Yuxuan Liu,Yuchen Liu,Yihang Liu,Lilin Wang,Wenhui Lei,Sheng Wang,Yinuo Xu,Yan Cui,Jialu Yao,Shunsuke Koga,Zhi Huang

Main category: cs.CV

TL;DR: TissueLab 是一个可协同演化的代理式人工智能系统,用于医学图像分析,支持实时交互、自动化工作流生成和专家反馈,整合病理学、放射学和空间组学工具,在临床重要任务中表现优于现有模型。

Details Motivation: 现有医疗AI系统在工具集成、实时交互和专家反馈方面存在不足,限制了其在医学图像分析中的性能和应用。 Method: 提出TissueLab,通过标准化工具输入输出,构建跨领域的工具工厂,实现自动规划、可解释工作流生成和实时分析,并结合主动学习持续从临床医生反馈中优化模型。 Result: TissueLab 在多种临床相关任务中性能优于端到端视觉语言模型(如GPT-5)和其他代理AI系统,能快速适应新疾病场景,无需大规模数据或长时间重训练。 Conclusion: TissueLab 构建了一个可持续、开源的生态系统,推动医学影像计算研究和转化应用,为下一代医疗AI奠定基础。 Abstract: Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present "TissueLab", a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

[152] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

Dayu Tan,Zhenpeng Xu,Yansen Su,Xin Peng,Chunhou Zheng,Weimin Zhong

Main category: cs.CV

TL;DR: 本文提出了一种名为HiPerformer的医学图像分割方法,通过模块化分层架构和局部-全局特征融合模块,有效整合多源特征,提升了分割精度和鲁棒性。

Details Motivation: 现有基于CNN-Transformer混合架构的方法在特征融合时存在信息冲突和丢失问题,难以有效整合局部细节与全局上下文。 Method: 提出HiPerformer,采用模块化分层编码器结构,实现多源特征的并行动态融合;设计局部-全局特征融合(LGFF)模块和渐进式金字塔聚合(PPA)模块,增强多尺度表征并抑制噪声。 Result: 在十一个公开数据集上实验表明,该方法优于现有的分割技术,具有更高的分割精度和鲁棒性。 Conclusion: HiPerformer通过创新的特征融合机制,实现了局部细节与全局语义的有效结合,显著提升了医学图像分割性能。 Abstract: Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.

[153] PerFace: Metric Learning in Perceptual Facial Similarity for Enhanced Face Anonymization

Haruka Kumagai,Leslie Wöhler,Satoshi Ikehata,Kiyoharu Aizawa

Main category: cs.CV

TL;DR: 本文提出了一种基于人类感知的人脸相似性度量方法,通过构建包含6400个三元组标注的数据集并采用度量学习来预测人脸相似性,在人脸匿名化中实现了匿名性与自然性的更好平衡。

Details Motivation: 现有模型仅关注二分类身份识别(是否为同一人),难以衡量细微的身份相似程度,无法满足人脸匿名化中对匿名性和自然性的平衡需求。 Method: 提出基于人类感知的人脸相似性度量方法,构建了包含6400个三元组标注的数据集,并采用度量学习进行相似性预测。 Result: 实验结果表明,该方法在人脸相似性预测和基于属性的人脸分类任务上均显著优于现有方法。 Conclusion: 所提出的人脸相似性度量方法能更精细地区分不同身份的相似程度,有助于提升人脸匿名化中匿名性与自然性的平衡效果。 Abstract: In response to rising societal awareness of privacy concerns, face anonymization techniques have advanced, including the emergence of face-swapping methods that replace one identity with another. Achieving a balance between anonymity and naturalness in face swapping requires careful selection of identities: overly similar faces compromise anonymity, while dissimilar ones reduce naturalness. Existing models, however, focus on binary identity classification "the same person or not", making it difficult to measure nuanced similarities such as "completely different" versus "highly similar but different." This paper proposes a human-perception-based face similarity metric, creating a dataset of 6,400 triplet annotations and metric learning to predict the similarity. Experimental results demonstrate significant improvements in both face similarity prediction and attribute-based face classification tasks over existing methods.

[154] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis

Xichen Xu,Yanshu Wang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu

Main category: cs.CV

TL;DR: 本文提出FAST,一种前景感知的扩散框架,用于工业异常分割,通过两个新模块实现高效且高质量的异常合成。

Details Motivation: 现有工业异常合成方法在采样效率和生成质量之间难以平衡,且忽视了异常区域与背景区域的统计差异,导致无法生成可控、结构特定的异常。 Method: 提出FAST框架,包含AIAS(异常信息引导的加速采样)和FARM(前景感知重建模块):AIAS通过粗到精聚合加速反向过程,可在10步内完成合成;FARM在每一步对掩码前景区域自适应调整异常感知噪声,保留局部异常信号。 Result: 在多个工业基准上的实验表明,FAST在下游分割任务中 consistently 优于现有的异常合成方法。 Conclusion: FAST通过引入前景感知机制和高效采样策略,显著提升了工业异常合成的质量与效率,为无监督异常分割提供了强有力的支持。 Abstract: Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://anonymous.4open.science/r/NeurIPS-938.

[155] A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices

Bishal Adhikari,Jiajia Li,Eric S. Michel,Jacob Dykes,Te-Ming Paul Tseng,Mary Love Tagert,Dong Chen

Main category: cs.CV

TL;DR: 本研究针对农田鹿类入侵问题,提出基于深度学习的实时检测方案,发布了包含3095张标注图像的公开数据集,并系统评估了YOLO系列模型在不同硬件平台上的性能,发现小型先进模型(如YOLOv11n、YOLOv8s、YOLOv9s)在精度与效率间表现最优。

Details Motivation: 传统防鹿措施成本高、效率低,且缺乏适用于现代农业的智能自主检测系统;同时,领域内缺少专用数据集和实地部署研究,限制了 deer detection 技术的发展。 Method: 构建了一个包含3,095张带边界框标注的鹿类图像的公开数据集,基于YOLOv8-v11四种架构的12个模型变体进行对比实验,并在NVIDIA RTX 5090、Raspberry Pi 5和NVIDIA Jetson AGX Xavier平台上评估其检测性能与实时性。 Result: YOLOv11n、YOLOv8s和YOLOv9s模型在AP@.5超过0.85的同时实现超过30 FPS的推理速度;树莓派无法实现实时检测,而Jetson AGX Xavier支持GPU加速后可在's'和'n'系列模型上达到实时性能。 Conclusion: 轻量级但架构先进的YOLO模型在准确性和计算效率之间取得了良好平衡,适合部署于边缘设备用于实际农田中的鹿类监测,推动智能农业防护系统的应用。 Abstract: The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies since these methods are often labor-intensive, costly, and ineffective for modern farming systems. To overcome this, there is a critical need for intelligent, autonomous solutions which require accurate and efficient deer detection. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the on-field deployability of deer detection systems. Addressing this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. The contributions of this work are threefold. First, we introduce a curated, publicly available dataset of 3,095 annotated images with bounding-box annotations of deer, derived from the Idaho Cameratraps project. Second, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures(v8, v9, v10, and v11). Finally, we benchmarked performance on a high-end NVIDIA RTX 5090 GPU and evaluated on two representative edge computing platforms: Raspberry Pi 5 and NVIDIA Jetson AGX Xavier. Results show that the real-time detection is not feasible in Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 FPS with GPU-accelerated inference on 's' and 'n' series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (AP@.5 > 0.85) and computational efficiency (FPS > 30). To support further research, both the source code and datasets are publicly available at https://github.com/WinnerBishal/track-the-deer.

[156] Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On

Qi Li,Shuwen Qiu,Julien Han,Xingzi Xu,Mehmet Saygin Seyfioglu,Kee Kiat Koo,Karim Bouyarmane

Main category: cs.CV

TL;DR: 本文研究了在虚拟试穿(VTON)技术中引入姿态控制的方法,通过在基线模型中空间拼接姿态数据(如姿态图和骨架图),在不增加额外参数的情况下提升姿态保持和生成效果的真实性,并提出混合掩码训练策略以支持不同姿态下的灵活商品融合。

Details Motivation: 为了在虚拟试穿中实现更准确的姿态对齐和多样化视角的沉浸式体验,需有效整合姿态控制,但面临姿态表征选择、无额外参数集成和控制灵活性之间的平衡挑战。 Method: 基于一个无需外部编码器或复杂注意力结构的基线VTON模型,通过空间拼接方式将姿态图或骨架图作为条件输入,比较不同姿态表示的效果,并采用细粒度与边界框结合的混合掩码进行训练。 Result: 实验表明,使用姿态图进行拼接效果最佳,显著提升了姿态保持性和图像 realism;混合掩码策略增强了模型在不同姿态和条件下的商品融合灵活性。 Conclusion: 在纯拼接范式下,通过姿态图拼接和混合掩码训练可有效实现高质量的姿态控制VTON,无需引入额外网络模块或参数,兼顾性能与简洁性。 Abstract: As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user's body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions.

[157] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Chen Wang,Chuhao Chen,Yiming Huang,Zhiyang Dou,Yuan Liu,Jiatao Gu,Lingjie Liu

Main category: cs.CV

TL;DR: 本文提出了PhysCtrl,一种基于物理的图像到视频生成框架,通过物理参数和力控制实现具有物理合理性和3D可控性的视频生成。

Details Motivation: 现有视频生成模型在生成逼真视频方面表现出色,但缺乏物理合理性和3D可控性。 Method: 提出了一种生成式物理网络,利用扩散模型学习四种材料(弹性体、沙子、橡皮泥和刚体)的物理动态分布,使用大规模合成数据集进行训练,并引入了新型时空注意力模块以模拟粒子相互作用并增强物理合理性。 Result: 实验表明,PhysCtrl能够生成逼真的、基于物理的运动轨迹,驱动图像到视频模型时可产生高保真且可控的视频,在视觉质量和物理合理性方面优于现有方法。 Conclusion: PhysCtrl在图像到视频生成中实现了更高的物理合理性和可控性,为未来基于物理的生成模型提供了新方向。 Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

[158] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju,Tianyu Wang,Yuqian Zhou,He Zhang,Qing Liu,Nanxuan Zhao,Zhifei Zhang,Yijun Li,Yuanhao Cai,Shaoteng Liu,Daniil Pakhomov,Zhe Lin,Soo Ye Kim,Qiang Xu

Main category: cs.CV

TL;DR: 本文提出了EditVerse,一个统一的图像和视频生成与编辑框架,通过将文本、图像和视频表示为统一的令牌序列,实现跨模态知识迁移和上下文学习,并设计了大规模视频编辑数据集和首个指令式视频编辑基准EditVerseBench,实验表明其性能达到SOTA。

Details Motivation: 当前视频生成与编辑仍因架构限制和数据稀缺而分散,缺乏统一框架,而图像领域已实现统一建模,因此需要构建一个能够同时处理图像和视频生成与编辑的通用模型。 Method: 将文本、图像和视频统一为令牌序列,利用自注意力机制实现跨模态学习;构建包含232K视频编辑样本的数据管道,并结合大规模图像与视频数据进行联合训练。 Result: 在多个任务上实现最先进的性能,超越现有开源和商业模型,在任意分辨率和时长下表现出强大的生成与编辑能力,并展现出跨模态的涌现能力。 Conclusion: EditVerse成功实现了图像与视频生成与编辑的统一建模,推动了多模态内容生成向统一架构的发展,验证了大规模联合训练和统一表示的有效性。 Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.