cs.CL [Back]

[1] RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

Atli Sigurgeirsson,Simon King

Main category: cs.CL

TL;DR: 研究提出了一种新的微调方法，利用模型的不可控方差来解决基于提示的文本到语音模型中的控制限制问题。

Details

Motivation: 基于提示的文本到语音模型允许用户通过自然语言指令控制语音的不同方面，但控制受限于训练期间暴露给模型的声学特征，并且在另一方面过于灵活，导致相同输入产生不可控的变化。 Method: 通过对数千个合成样本进行主成分分析，确定了占输出方差比例最高的潜在特征，并将其作为新标签用于二次微调。 Result: 在两个基于表达性冰岛语音语料库的模型上评估了所提出的方法，其中一个具有情感披露，另一个没有。对于没有情感披露的模型，该方法产生了改进模型整体可控性的连续和离散特征。 Conclusion: 通过利用模型的不可控方差，这种方法提高了模型的整体可控性，尤其是在没有情感披露的情况下。 Abstract: A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction. Although user-friendly, such approaches are on one hand constrained: control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics. We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model. Through principal component analysis of thousands of synthesised samples, we determine latent features that account for the highest proportion of the output variance and incorporate them as new labels for secondary fine-tuning. We evaluate the proposed methods on two models trained on an expressive Icelandic speech corpus, one with emotional disclosure and one without. In the case of the model without emotional disclosure, the method yields both continuous and discrete features that improve overall controllability of the model.

[2] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

K. Sahit Reddy,N. Ragavenderan,Vasanth K.,Ganesh N. Naik,Vishalakshi Prabhu,Nagaraja G. S

Main category: cs.CL

TL;DR: MedicalBERT is a domain-optimized language model that improves upon BERT for biomedical NLP tasks, achieving better performance than existing models through specialized training and vocabulary.

Details

Motivation: Biomedical literature poses challenges due to domain-specific terminology, which traditional models like Word2Vec and Bi-LSTM cannot fully address. While GPT and T5 capture context, they lack bidirectional understanding, unlike BERT. Method: The authors proposed MedicalBERT, a pretrained BERT model trained on a large biomedical dataset with domain-specific vocabulary. The model was further optimized and fine-tuned for diverse tasks like named entity recognition, relation extraction, question answering, sentence similarity, and document classification. Result: MedicalBERT outperforms existing BERT-based models (BioBERT, SciBERT, ClinicalBERT) on most benchmarks, surpassing the general-purpose BERT model by an average of 5.67% across all evaluated tasks. Conclusion: MedicalBERT demonstrates superior performance compared to other BERT-based models in various biomedical NLP tasks, highlighting the effectiveness of domain-specific pretraining and transfer learning techniques. Abstract: Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].

[3] Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

Aldan Creo,Raul Castro Fernandez,Manuel Cebrian

Main category: cs.CL

TL;DR: 该研究分析了超过200万条真实对话中的越狱尝试，发现其复杂性并不明显高于正常对话，且助手回应的安全性随时间提升。研究挑战了攻击者与防御者持续升级的假设，强调人类创造力的限制可能是AI安全发展的关键因素。

Details

Motivation: 随着大型语言模型（LLMs）的广泛应用，理解越狱策略的复杂性和演变对于人工智能安全至关重要。本研究旨在探索当前越狱攻击的实际复杂性及其发展趋势，以评估AI安全机制的有效性和潜在风险。 Method: 研究使用了超过200万条来自不同平台的真实对话数据，包括专门的越狱社区和通用聊天机器人。通过概率测量、词汇多样性、压缩比和认知负荷指标等多种复杂度度量方法对越狱尝试进行了分析，并进行了时间趋势分析。 Result: 研究发现，越狱尝试在复杂性上并未显著高于正常对话，这一现象在专业越狱社区和普通用户中均一致存在。此外，用户攻击的毒性和复杂性随时间保持稳定，而助手回应的毒性则呈下降趋势。复杂度分布中未观察到幂律缩放，这表明越狱发展存在自然上限。 Conclusion: 研究表明，尽管攻击者尝试突破大型语言模型的安全措施，但这些尝试的复杂性并未显著高于正常对话，且随着时间推移，助手回应的毒性有所下降，表明安全机制正在改善。研究挑战了攻击者与防御者之间军备竞赛的传统叙述，指出人类创造力的限制可能决定了模型安全性的演化边界。 Abstract: As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

[4] Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications

Prudence Djagba,Chimezie A. Odinakachukwu

Main category: cs.CL

TL;DR: 本文评估了FinGPT在金融领域的自然语言处理能力，发现它在分类任务中表现良好，但在推理和生成任务中存在显著不足。

Details

Motivation: 评估FinGPT在金融领域中的表现，确定其在不同NLP任务中的优势和限制，并为未来研究提供基准。 Method: 在六个关键的自然语言处理任务中评估FinGPT，使用金融特定数据集来测试其在实际金融应用中的能力和局限性。 Result: 结果表明，FinGPT在分类任务中表现出色，与GPT-4相当；但在涉及推理和生成的任务中表现明显较低。与GPT-4和人类基准的比较突出了在数值准确性和复杂推理方面的显著性能差距。 Conclusion: FinGPT是有效的某些结构化的金融任务，但还不是一个全面的解决方案。研究提供了未来研究的有用基准，并强调了金融语言模型在架构改进和领域特定优化方面的必要性。 Abstract: This work evaluates FinGPT, a financial domain-specific language model, across six key natural language processing (NLP) tasks: Sentiment Analysis, Text Classification, Named Entity Recognition, Financial Question Answering, Text Summarization, and Stock Movement Prediction. The evaluation uses finance-specific datasets to assess FinGPT's capabilities and limitations in real-world financial applications. The results show that FinGPT performs strongly in classification tasks such as sentiment analysis and headline categorization, often achieving results comparable to GPT-4. However, its performance is significantly lower in tasks that involve reasoning and generation, such as financial question answering and summarization. Comparisons with GPT-4 and human benchmarks highlight notable performance gaps, particularly in numerical accuracy and complex reasoning. Overall, the findings indicate that while FinGPT is effective for certain structured financial tasks, it is not yet a comprehensive solution. This research provides a useful benchmark for future research and underscores the need for architectural improvements and domain-specific optimization in financial language models.

[5] Mechanistic Indicators of Understanding in Large Language Models

Pierre Beckmann,Matthieu Queloz

Main category: cs.CL

TL;DR: This paper explores mechanistic interpretability in LLMs, proposing a three-tiered framework for machine understanding based on internal structural connections, while concluding that LLM cognition differs significantly from human understanding.

Details

Motivation: To challenge the view that Large Language Models (LLMs) rely solely on superficial statistics and to provide an accessible introduction to mechanistic interpretability while proposing a novel framework for understanding machine cognition. Method: The paper synthesizes recent findings in mechanistic interpretability (MI) and introduces a three-tiered theoretical framework for machine understanding. It integrates these findings into a coherent structure while exploring the 'parallel mechanisms' phenomenon. Result: A synthesis of MI findings revealed that LLMs develop internal structures analogous to human understanding. This led to the proposal of a three-tiered model: conceptual understanding (learning connections through features), state-of-the-world understanding (tracking factual dynamics), and principled understanding (discovering circuits connecting facts). Conclusion: The paper concludes that while LLMs demonstrate forms of understanding, their cognitive architecture differs from human cognition. The focus should shift from whether LLMs understand to understanding how their unique cognitive processes function. Abstract: Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. Here, we offer an accessible synthesis of these findings that doubles as an introduction to MI, all while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of machine understanding. First, conceptual understanding emerges when a model forms "features" as directions in latent space, thereby learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a "circuit" that connects these facts. However, we conclude by exploring the "parallel mechanisms" phenomenon, arguing that while LLMs exhibit forms of understanding, their cognitive architecture remains different from ours, and the debate should shift from whether LLMs understand to how their strange minds work.

[6] Review, Remask, Refine (R3): Process-Guided Block Diffusion for Text Generation

Nikita Mounier,Parsa Idehpour

Main category: cs.CL

TL;DR: R3 framework improves iterative text generation by reviewing, remasking, and refining generated content using a Process Reward Model.

Details

Motivation: The motivation is to enable models to identify and correct their own errors efficiently in iterative text generation. Method: Review, Remask, Refine (R3) framework uses a Process Reward Model to evaluate and refine text generation. Result: The result is an improved final output as the model focuses more intensively on specific sub-optimal parts of past generations. Conclusion: R3 framework successfully enhances the iterative text generation by efficiently identifying and correcting errors without additional model training. Abstract: A key challenge for iterative text generation is enabling models to efficiently identify and correct their own errors. We propose Review, Remask, Refine (R3), a relatively simple yet elegant framework that requires no additional model training and can be applied to any pre-trained masked text diffusion model (e.g., LLaDA or BD3-LM). In R3, a Process Reward Model (PRM) is utilized for the Review of intermediate generated blocks. The framework then translates these PRM scores into a Remask strategy: the lower a block's PRM score, indicating potential mistakes, the greater the proportion of tokens within that block are remasked. Finally, the model is compelled to Refine these targeted segments, focusing its efforts more intensively on specific sub-optimal parts of past generations, leading to improved final output.

[7] Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks

Aryan Varshney,Venkat Ram Reddy Ganuthula

Main category: cs.CL

TL;DR: The study finds that while LLMs show some consistent behavior when evaluating resumes, they significantly differ from human judgment, suggesting caution in their use for automated hiring systems.

Details

Motivation: The study aims to understand whether large language models (LLMs) exhibit consistent behavior or random variation when screening resumes against job descriptions and how their performance compares to human experts. Method: The study used controlled datasets to test three LLMs (Claude, GPT, and Gemini) across various contexts (No Company, Firm1 [MNC], Firm2 [Startup], Reduced Context) with identical and randomized resumes. The results were benchmarked against three human recruitment experts. Statistical methods like analysis of variance and paired t-tests were employed for data analysis. Result: Analysis of variance revealed significant mean differences in four of eight LLM-only conditions and consistently significant differences between LLM and human evaluations. Paired t-tests showed varying levels of adaptation to company context among the LLMs, with GPT adapting strongly, Gemini partially, and Claude minimally. All LLMs differed significantly from human experts across contexts. Conclusion: LLMs offer interpretable patterns with detailed prompts but diverge substantially from human judgment, suggesting careful consideration in their deployment in automated hiring systems. Abstract: This study investigates whether large language models (LLMs) exhibit consistent behavior (signal) or random variation (noise) when screening resumes against job descriptions, and how their performance compares to human experts. Using controlled datasets, we tested three LLMs (Claude, GPT, and Gemini) across contexts (No Company, Firm1 [MNC], Firm2 [Startup], Reduced Context) with identical and randomized resumes, benchmarked against three human recruitment experts. Analysis of variance revealed significant mean differences in four of eight LLM-only conditions and consistently significant differences between LLM and human evaluations (p < 0.01). Paired t-tests showed GPT adapts strongly to company context (p < 0.001), Gemini partially (p = 0.038 for Firm1), and Claude minimally (p > 0.1), while all LLMs differed significantly from human experts across contexts. Meta-cognition analysis highlighted adaptive weighting patterns that differ markedly from human evaluation approaches. Findings suggest LLMs offer interpretable patterns with detailed prompts but diverge substantially from human judgment, informing their deployment in automated hiring systems.

[8] Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Zhibo Zhang,Yuxi Li,Kailong Wang,Shuai Yuan,Ling Shi,Haoyu Wang

Main category: cs.CL

TL;DR: 该研究提出了一种名为ETTA的新框架，通过线性变换识别和减弱嵌入空间中的毒性敏感维度，从而绕过模型拒绝行为，同时保持语言连贯性。

Details

Motivation: 由于大型语言模型（LLMs）的开放性，嵌入空间中毒这一攻击方式带来了显著的安全风险，而现有的通用扰动方法对LLM在嵌入层面的安全对齐理解仍不足。 Method: ETTA框架采用线性变换来识别并减弱嵌入空间中的毒性敏感维度，无需模型微调或访问训练数据即可实现攻击。 Result: 在使用AdvBench基准测试的五个代表性开源LLM上评估，ETTA实现了88.61%的平均攻击成功率，优于最佳基线11.34%，并且适用于增强安全性的模型。 Conclusion: 研究结果揭示了当前对齐策略中的关键漏洞，并强调了需要嵌入感知防御机制的重要性。 Abstract: Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

[9] Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Li Li,Yongliang Wu,Jingze Zhu,Jiawei Peng,Jianfei Cai,Xu Yang

Main category: cs.CL

TL;DR: 本文通过对外部演示配置和内部注意力机制的系统研究，深入探讨了大型多模态模型中的多模态上下文学习，揭示了配置策略对性能的影响及典型注意力特征，并提出了新的度量标准和应用方向。

Details

Motivation: 大型语言模型的成功启发研究者开发了具有上下文学习能力的大型多模态模型。然而，对于多模态上下文学习的演示配置探索仍处于初步阶段。此外，上下文示例的可控性为观察和分析大型多模态模型在不同输入下的推理特性提供了一种高效且经济的方式。 Method: 本文采用外部和内部分析相结合的方法，从三个维度探索演示配置策略：样本数量、图像检索和标题分配，并使用多个指标系统地评估和总结关键发现。此外，还分析了典型多模态模型的注意力特征，并基于注意力机制开发了度量标准以量化模型行为。 Result: 研究揭示了不同演示配置策略对模型性能的影响，并通过内部检查发现了典型的注意力特征模式。此外，通过辅助实验探索了基于注意力的模型加速和压缩的可行性，并比较了具有相同模型设计和预训练策略的大型多模态模型的性能差异。 Conclusion: 本文通过对外部和内部因素的全面分析，揭示了上下文示例配置策略对模型性能的影响，并提供了理解大型多模态模型中多模态上下文学习的双重视角。研究还表明，结合外部和内部分析的方法及新提出的度量标准可应用于更广泛的研究领域。 Abstract: The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.

[10] "Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs

W. Russell Neuman,Chad Coleman,Ali Dasdan,Safinah Ali,Manan Shah,Kund Meghani

Main category: cs.CL

TL;DR: 本研究分析了七种主流大型语言模型的政治倾向，发现它们普遍偏向自由主义，这源于训练语料、人类反馈强化学习、学术伦理框架以及安全驱动的微调实践等因素。

Details

Motivation: 近期研究表明，大多数商业大型语言模型（LLMs）在伦理和政治反应中呈现出一致的自由主义倾向，但其根本原因和影响尚不明确。因此，本文旨在系统性地探究这一现象背后的机制及意义。 Method: 该研究使用了多管齐下的方法，包括道德基础理论、十多个既定政治意识形态量表以及一个新的当前政治争议指数，对七种主流LLMs进行分析。同时比较了基础模型与微调模型之间的差异，并探讨了其背后的因素。 Result: 研究发现，多数模型在价值观上优先考虑自由主义倾向，尤其是关怀和公平原则。微调过程通常增强了这种自由主义倾向，且通过自我报告和实证测试均得到验证。 Conclusion: 论文得出结论，大型语言模型的政治倾向并非编程错误或程序员个人偏好所致，而是训练于以民主权利为中心的语料库后产生的特性。此外，这种“自由倾斜”可能为审视集体推理提供新的视角，而非破坏民主讨论。 Abstract: Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's Llama 4, Mistral 7b Le Chat and High-Flyer's DeepSeek R1 -- using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and a new index of current political controversies. We find strong and consistent prioritization of liberal-leaning values, particularly care and fairness, across most models. Further analysis attributes this trend to four overlapping factors: Liberal-leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse and safety-driven fine-tuning practices. We also distinguish between political "bias" and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine-tuned model pairs reveals that fine-tuning generally increases liberal lean, an effect confirmed through both self-report and empirical testing. We argue that this "liberal tilt" is not a programming error or the personal preference of programmers but an emergent property of training on democratic rights-focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls' famous veil-of ignorance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective reasoning.

[11] Better Together: Quantifying the Benefits of AI-Assisted Recruitment

Ada Aka,Emil Palikot,Ali Ansari,Nima Yazdani

Main category: cs.CL

TL;DR: This study explores how AI in recruitment can enhance hiring efficiency and affect candidate outcomes, showing that AI-assisted methods have a significant impact on employment probabilities and selection preferences.

Details

Motivation: To understand the impact of artificial intelligence on recruitment efficiency and candidate selection, as empirical evidence in this area is limited. Method: Randomly assigned 37,000 applicants for a junior-developer position to either a traditional recruitment process or an AI-assisted recruitment pipeline, analyzing the outcomes and differences between the two methods. Result: 54% of candidates passed the final interview in the AI-assisted pipeline compared to 34% in the traditional pipeline; 23% of AI-assisted applicants found new jobs compared to 18% in the traditional method. AI favored younger, less experienced candidates. Conclusion: The AI-assisted recruitment pipeline improves hiring efficiency and influences the selection of candidates, showing potential implications for recruitment strategies. Abstract: Artificial intelligence (AI) is increasingly used in recruitment, yet empirical evidence quantifying its impact on hiring efficiency and candidate selection remains limited. We randomly assign 37,000 applicants for a junior-developer position to either a traditional recruitment process (resume screening followed by human selection) or an AI-assisted recruitment pipeline incorporating an initial AI-driven structured video interview before human evaluation. Candidates advancing from either track faced the same final-stage human interview, with interviewers blind to the earlier selection method. In the AI-assisted pipeline, 54% of candidates passed the final interview compared with 34% from the traditional pipeline, yielding an average treatment effect of 20 percentage points (SE 12 pp.). Five months later, we collected LinkedIn profiles of top applicants from both groups and found that 18% (SE 1.1%) of applicants from the traditional track found new jobs compared with 23% (SE 2.3%) from the AI group, resulting in a 5.9 pp. (SE 2.6 pp.) difference in the probability of finding new employment between groups. The AI system tended to select younger applicants with less experience and fewer advanced credentials. We analyze AI-generated interview transcripts to examine the selection criteria and conversational dynamics. Our findings contribute to understanding how AI technologies affect decision making in recruitment and talent acquisition while highlighting some of their potential implications.

[12] A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

Sonali Sharma,Ahmed M. Alaa,Roxana Daneshjou

Main category: cs.CL

TL;DR: This study analyzed the decline in medical disclaimers in AI model outputs from 2022 to 2025, emphasizing the necessity for improved safeguards in clinical contexts.

Details

Motivation: The motivation behind this study is the increasing use of generative AI models in interpreting medical images and answering clinical questions, often producing inaccurate responses. Safety measures like medical disclaimers are crucial to inform users that AI outputs are not professionally vetted or a substitute for medical advice. Method: The study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025 using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions. Result: Medical disclaimer presence in LLM and VLM outputs dropped significantly over the years, from 26.3% in 2022 to 0.97% in 2025 for LLMs, and from 19.6% in 2023 to 1.05% in 2025 for VLMs. By 2025, most models did not display any disclaimers. Conclusion: The study concludes that there is a significant decline in the presence of medical disclaimers in the outputs of LLMs and VLMs from 2022 to 2025, highlighting the need for implementing disclaimers as safeguards in line with the clinical context. Abstract: Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.

[13] Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding

Hong Jia,Shiya Fu,Vassilis Kostakos,Feng Xia,Ting Dang

Main category: cs.CL

TL;DR: 该研究发现小型语言模型 (SLMs) 在心理健康理解方面的表现接近大型语言模型 (LLMs)，尤其是在零样本和少样本学习环境下，显示出其作为隐私保护工具的潜力。

Details

Motivation: 随着小型语言模型 (SLMs) 作为隐私保护替代方案的出现，我们想了解它们与大型语言模型 (LLMs) 相比，在心理健康应用中的内在理解能力。 Method: 通过零样本和少样本学习范式对五种最先进的 SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) 和三种 LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) 进行基准测试，评估它们在六个心理健康理解任务上的表现。 Result: 在二分类任务中，SLMs 的平均性能仅比 LLMs 低 2%，F1 分数分别为 0.64 和 0.66。在多类严重性任务中，两种模型的表现都下降超过 30%。少样本提示学习为 SLMs 带来了显著改进（最高达 14.6%），而 LLMs 的提升则更为不稳定。 Conclusion: SLMs 在心理健康理解方面表现出色，特别是在零样本和少样本学习设置下，尽管参数数量远少于 LLMs。这表明 SLMs 可以作为隐私保护工具用于分析敏感的在线文本数据，并且在可扩展的心理健康筛查工具中具有潜力。 Abstract: The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2\% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30\%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6\%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.

[14] Integrating External Tools with Large Language Models to Improve Accuracy

Nripesh Niketan,Hadj Batatia

Main category: cs.CL

TL;DR: This paper introduces the Athena framework, which integrates external tools with large language models to improve their performance in educational settings, achieving significant improvements in mathematical and scientific reasoning accuracy compared to existing models.

Details

Motivation: Large language models often produce low-quality responses or hallucinate without relevant contextual information. Integrating LLMs with external tools provides up-to-date data and improves accuracy, especially in educational contexts. Method: A framework was developed to integrate external tools, such as APIs and computational utilities, with large language models (LLMs) to enhance their capabilities in answering educational queries. Evaluation was conducted using datasets from the Multi-Modal Language Understanding (MMLU) collection focusing on mathematical and scientific reasoning. Result: The proposed Athena framework achieved 83% accuracy in mathematical reasoning and 88% in scientific reasoning, outperforming state-of-the-art models like GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5. The best baseline model, LLaMA-Large, achieved only 67% and 79% respectively. Conclusion: The Athena framework significantly improves the performance of large language models in educational settings by integrating external tools, opening possibilities for complex computing ecosystems around LLMs. Abstract: This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.

[15] Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights

Deepali Mishra,Chaklam Silpasuwanchai,Ashutosh Modi,Madhumita Sushil,Sorayouth Chumnanvej

Main category: cs.CL

TL;DR: 本文综述了MedVQA在2018-2024年间的研究进展及在印度和泰国50名临床医生中的调查结果，指出MedVQA虽然有潜力，但在实际应用中还存在诸多问题，如缺乏临床相关性、多视角支持不足等。

Details

Motivation: 尽管模型和数据集不断进步，但MedVQA在临床工作流程中的整合仍然有限，因此本研究旨在系统性地审查MedVQA的实际效用、挑战和差距。 Method: 通过Arksey和O'Malley范围综述框架，采用双管齐下的方法：(1)回顾研究以确定放射学工作流程中的关键概念、进展和研究空白；(2)调查临床医生对MedVQA临床相关性的看法。 Result: 研究发现近60%的问答对是非诊断性的且缺乏临床相关性。大多数数据集和模型不支持多视角、多分辨率成像、EHR集成或领域知识，而这些是临床诊断所必需的功能。此外，当前评估指标与临床需求之间存在明显错配。只有29.8%的临床医生认为MedVQA系统非常有用。 Conclusion: MedVQA尽管有潜力，但其临床整合仍面临挑战，包括有限的多模态分析、缺乏患者背景信息以及评估方法不一致等问题需要解决。 Abstract: Medical Visual Question Answering (MedVQA) is a promising tool to assist radiologists by automating medical image interpretation through question answering. Despite advances in models and datasets, MedVQA's integration into clinical workflows remains limited. This study systematically reviews 68 publications (2018-2024) and surveys 50 clinicians from India and Thailand to examine MedVQA's practical utility, challenges, and gaps. Following the Arksey and O'Malley scoping review framework, we used a two-pronged approach: (1) reviewing studies to identify key concepts, advancements, and research gaps in radiology workflows, and (2) surveying clinicians to capture their perspectives on MedVQA's clinical relevance. Our review reveals that nearly 60% of QA pairs are non-diagnostic and lack clinical relevance. Most datasets and models do not support multi-view, multi-resolution imaging, EHR integration, or domain knowledge, features essential for clinical diagnosis. Furthermore, there is a clear mismatch between current evaluation metrics and clinical needs. The clinician survey confirms this disconnect: only 29.8% consider MedVQA systems highly useful. Key concerns include the absence of patient history or domain knowledge (87.2%), preference for manually curated datasets (51.1%), and the need for multi-view image support (78.7%). Additionally, 66% favor models focused on specific anatomical regions, and 89.4% prefer dialogue-based interactive systems. While MedVQA shows strong potential, challenges such as limited multimodal analysis, lack of patient context, and misaligned evaluation approaches must be addressed for effective clinical integration.

[16] CRISP: Complex Reasoning with Interpretable Step-based Plans

Matan Vetzler,Koren Lazar,Guy Uziel,Eran Hirsch,Ateret Anaby-Tavor,Leshem Choshen

Main category: cs.CL

TL;DR: The study introduces CRISP, a dataset for high-level plan generation in reasoning tasks, showing that fine-tuning on CRISP improves performance and generalizability over existing methods.

Details

Motivation: Chain-of-Thought reasoning is insufficient for complex problems, and generating effective high-level plans through few-shot prompting alone has limitations. Method: Creation of CRISP dataset with validated high-level plans for mathematical reasoning and code generation; small model fine-tuning on this dataset. Result: Fine-tuned small models outperformed larger few-shot prompted models in plan quality and downstream task performance; cross-domain generalization was observed. Conclusion: CRISP fine-tuning enhances plan generation quality and reasoning capabilities across domains, surpassing Chain-of-Thought and few-shot prompting approaches. Abstract: Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.

[17] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich,Gal Chechik

Main category: cs.CL

TL;DR: AblationBench 是一个用于评估自主语言模型代理在经验 AI 研究中消融计划任务的基准测试套件，包括两个任务：AuthorAblation 和 ReviewerAblation。

Details

Motivation: 设计消融实验是经验 AI 研究的关键部分，但目前仍具有挑战性。因此，开发 AblationBench 来帮助作者和审稿人改进消融实验的设计和识别缺失的消融。 Method: 构建了包含 AuthorAblation（83 个实例）和 ReviewerAblation（350 个实例）的基准测试，并使用基于 LM 的评分器进行自动评估。此外，还比较了不同的提示方法（如思维链）与现有代理方法的效果。 Result: 前沿 LMs 在这些任务上表现仍有局限，最佳系统平均仅能识别 29% 的原始消融。思维链提示方法优于现有代理方法。 Conclusion: AblationBench 提供了一个评估和改进自主代理在消融实验设计方面能力的框架，同时揭示了当前 LMs 的局限性和改进方向。 Abstract: Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.

Junyi Wen,Junyuan Liang,Zicong Hong,Wuhui Chen,Zibin Zheng

Main category: cs.CL

TL;DR: Krul 是一个多轮次LLM推理系统，通过动态选择压缩策略、基于注意力相似性的KV缓存恢复和优化调度器，在提高效率的同时减少存储需求。

Details

Motivation: 现有方法采用固定的KV缓存压缩方案，忽略不同对话之间的注意力模式差异，导致准确性下降，因此需要一种更灵活高效的解决方案。 Method: Krul 动态选择基于层间注意力相似性的压缩策略，并采用预计算-加载流水线来恢复KV缓存，同时引入令牌级别的异构注意力相似性估计和气泡消除调度器。 Result: 与现有最佳方法相比，Krul 在多个真实任务中实现了1.5x-2.68x的时间至首字（TTFT）降低，以及1.33x-2.35x的KV缓存存储减少，且生成质量未受影响。 Conclusion: Krul 提供了一种高效准确的多轮LLM推理解决方案，能够根据对话特性自适应调整KV缓存压缩策略，显著提升性能和资源利用率。 Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.

[19] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

Sebastian Walter,Hannah Bast

Main category: cs.CL

TL;DR: 本文介绍了一种利用大型语言模型生成SPARQL查询的新方法，能够在不需微调的情况下，在多种知识图谱基准测试中取得良好成果。

Details

Motivation: 研究动机是为了解决从自然语言问题或关键词查询生成SPARQL查询的问题，以提高在不同种类和规模的知识图谱上的性能。 Method: 论文的方法是利用大型语言模型生成SPARQL查询，通过执行SPARQL查询和搜索相关的IRIs及字面量来探索知识图谱，而无需进行微调。 Result: 实验结果显示，该方法在多个基准测试中表现出色，包括在Wikidata上的state-of-the-art结果和在Freebase上接近最佳few-shot方法的表现。 Conclusion: 论文的结论是，提出的方法在各种基准测试中表现良好，尤其是在Wikidata上达到了最先进的结果，在Freebase上接近最佳的few-shot方法。 Abstract: We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.

[20] Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

Reilly Raab,Mike Parker,Dan Nally,Sadie Montgomery,Anastasia Bernat,Sai Munikoti,Sameera Horawalavithana

Main category: cs.CL

TL;DR: 提出一种可审计的、基于语言模型的框架，用于提高现实世界决策流程效率。

Details

Motivation: 语言模型在文本处理任务中潜力巨大，但其在现实世界的应用受到安全、可解释性和偏差等问题的阻碍。需要一种透明且可审计的方式利用语言模型，以降低风险，并使人类专家能够专注于决策而非数据处理或提示工程。 Method: 提出一个框架，允许在传统异步代码中声明静态类型的、由语言模型驱动的子程序（类似函数的可调用过程），并通过人类专家的稀疏反馈在线提升每个子程序的性能。所有由语言模型生成的工件（如提示、输入、输出和数据依赖）都会被记录并按需提供审计。 Result: 开发了名为“CommentNEPA”的应用，用于编译、组织和总结根据《国家环境政策法案》提交的公众意见，并通过与历史人工标注数据对比评估其输出效果。 Conclusion: 该框架为语言模型的安全、透明使用提供了可行方案，并在实际应用场景中展示了其有效性，有望推动语言模型在医疗、法律等领域的负责任应用。 Abstract: The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner -- minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code -- such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth'' data as labelled by human annotators during the preparation of official environmental impact statements.

[21] Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

Vivek Chari,Benjamin Van Durme

Main category: cs.CL

TL;DR: Compactor 通过查询无关的KV缓存压缩技术，显著降低内存消耗，同时保持模型性能。

Details

Motivation: 大语言模型在生成长文本时面临KV缓存内存需求过高的问题，限制了实际部署效果。 Method: 使用近似杠杆得分确定标记重要性，并引入上下文校准压缩方法。 Result: Compactor 在保留一半标记的情况下保持相同性能，并在 Longbench 上平均减少63%的KV内存占用。 Conclusion: Compactor 是一种参数免费的KV缓存压缩策略，能有效减少内存使用并保持性能。 Abstract: Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving cost. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a parameter-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 1/2 the tokens in both synthetic and real-world context tasks, with minimal computational overhead. We further introduce a procedure for context-calibrated compression, which allows one to infer the maximum compression ratio a given context can support. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.

[22] Distilling Empathy from Large Language Models

Henry J. Xie,Jinghan Zhang,Xinhao Zhang,Kunpeng Liu

Main category: cs.CL

TL;DR: This paper introduces an effective method for transferring empathy from large language models (LLMs) to smaller ones (SLMs) using a two-step fine-tuning approach and specialized prompts, achieving a 90% success rate in empathetic response generation.

Details

Motivation: Empathy is essential in human interactions, especially in resource-constrained environments where smaller language models (SLMs) are deployed, such as smartphones. The motivation is to ensure that empathy present in Large Language Models (LLMs) is effectively transferred to SLMs through distillation. Method: The paper employs a two-step fine-tuning process using empathetic dialogue datasets distilled from LLMs. It also proposes four unique prompt sets for targeted empathy improvement and compares their performance against basic direct prompting. Result: SLMs fine-tuned with the proposed method showed a 90% win rate in generating empathetic responses compared to the base SLM. Targeted empathy prompts improved performance by 10% over basic prompting methods. Conclusion: The study concludes that the proposed two-step fine-tuning process, combined with targeted empathy improvement prompts, significantly enhances empathy distillation from LLMs to SLMs, outperforming basic methods. Abstract: The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10% improvement in win rate.

[23] TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs

Duygu Nur Yaldiz,Yavuz Faruk Bakman,Sungmin Kang,Alperen Öziş,Hayrettin Eren Yildiz,Mitash Ashish Shah,Zhiqi Huang,Anoop Kumar,Alfy Samuel,Daben Liu,Sai Praneeth Karimireddy,Salman Avestimehr

Main category: cs.CL

TL;DR: 本文介绍了 TruthTorchLM，一个用于预测生成型大语言模型输出真实性的综合开源库，包含多种方法并在多个数据集上进行了验证。

Details

Motivation: 生成型大语言模型（LLMs）不可避免地会产生虚假信息，准确预测其输出的真实性在高风险场景中尤为重要。为了加速这一领域的发展并使真实性预测方法更易于使用，需要一个新的综合性工具。 Method: 开发了一个名为 TruthTorchLM 的开放源代码 Python 库，包含超过30种真实性预测方法，并在 TriviaQA、GSM8K 和 FactScore-Bio 数据集上进行了评估。 Result: TruthTorchLM 提供了多种真实性预测技术，覆盖不同的计算成本、访问级别、文档需求和监督类型，并与 HuggingFace 和 LiteLLM 兼容，支持本地和 API 模型。 Conclusion: TruthTorchLM 是一个全面且可扩展的开源库，为预测生成型大语言模型输出的真实性提供了多种方法，并促进了该领域的研究进展。 Abstract: Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM

[24] Simple Mechanistic Explanations for Out-Of-Context Reasoning

Atticus Wang,Joshua Engels,Oliver Clive-Griffin

Main category: cs.CL

TL;DR: 这篇论文提出了一种可能的机制——转向向量，用于解释LLMs如何在Out-of-context reasoning任务中表现出强大的泛化能力。

Details

Motivation: 论文旨在探究大型语言模型（LLMs）为何能在超出上下文分布的任务中表现出色，即所谓的Out-of-context reasoning (OOOCR) 现象。 Method: 研究者通过分析LoRA微调方法对模型的影响，测试转向向量是否能够直接训练并诱导OOOCR现象，并验证其在特定条件行为任务上的效果。 Result: 研究表明，许多OOOCR现象可以通过LoRA微调方法添加一个恒定转向向量来解释，该向量能够引导模型朝向更广泛的概念；此外，即使从头开始训练转向向量也能诱导OOOCR，并且在需要条件行为的任务上也有效。 Conclusion: 论文得出结论，微调过程中学习到的转向向量可以解释LLM在OOOCR任务中的泛化能力。 Abstract: Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.

[25] Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

KV Aditya Srivatsa,Kaushal Kumar Maurya,Ekaterina Kochmar

Main category: cs.CL

TL;DR: This study investigates how well LLMs emulate real students in terms of performance, revealing that while some models outperform average students, there is a need for improved training and evaluation strategies to ensure consistent alignment across grades and subjects.

Details

Motivation: The increasing use of LLMs as proxy students in Intelligent Tutoring Systems (ITSs) and in piloting test questions necessitates an investigation into how accurately these models emulate real students. Method: The researchers applied an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations using a dataset of 489 items from the National Assessment of Educational Progress (NAEP). Result: Strong general-purpose LLMs consistently outperform the average student at every grade without guidance. Weaker or domain-mismatched models may align incidentally. Grade-enforcement prompts change models' performance, but alignment with the average grade-level student remains model- and prompt-specific. Conclusion: The study concludes that while some LLMs can emulate student performance, there is a need for new training and evaluation strategies to ensure alignment across subjects and grades. Guidelines are provided for selecting viable proxies based on the findings. Abstract: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.

[26] Exploring Gender Differences in Chronic Pain Discussions on Reddit

Ancita Maria Andrade,Tanvi Banerjee,Ramakrishna Mundugar

Main category: cs.CL

TL;DR: 本研究利用NLP技术分析社交媒体内容，揭示了不同性别在疼痛体验中的语言模式和疾病分布差异。

Details

Motivation: 早期关于疼痛的研究往往忽略了性别因素的影响，而本研究旨在通过自然语言处理技术探索性别在疼痛体验中的作用。 Method: 使用自然语言处理（NLP）中的隐藏属性模型-卷积神经网络（HAM-CNN）对用户的帖子进行分类，并分析性别之间的语言差异和疾病分布。 Result: 实现了基于用户名聚合帖子的性别分类，F1得分为0.86；女性的帖子更倾向于情感表达，偏头痛和鼻窦炎在女性中更为常见，并且止痛药对不同性别的影响存在差异。 Conclusion: 该研究成功利用NLP技术深入分析了不同性别人群在疼痛体验上的语言差异和疾病分布特征。 Abstract: Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals' pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.

[27] KAT-V1: Kwai-AutoThink Technical Report

Zizheng Zhan,Ken Deng,Huaixi Tang,Wen Xiang,Kun Wu,Weihao Li,Wenqiang Zhu,Jingxuan Xu,Lecheng Huang,Zongxian Feng,Shaojie Wang,Shangpeng Yan,Jiaheng Liu,Zhongyuan Peng,Zuchen Gao,Haoyang Huang,Ziqi Zhan,Yanan Wu,Yuanxing Zhang,Jian Yang,Guang Chen,Haotian Zhang,Bin Chen,Bing Yu

Main category: cs.CL

TL;DR: 本文提出了一个名为Kwaipilot-AutoThink (KAT) 的40B开源大型语言模型，通过创新的训练范式解决了推理密集型任务中的过度思考问题，并展示了卓越的性能与效率。

Details

Motivation: 解决推理密集型任务中的过度思考问题，开发出高效、可控的大型语言模型。 Method: 构建基于新标记流水线和多智能体合成策略的双模式数据集，应用增强型知识蒸馏技术Multi-Token Prediction (MTP)，实现冷启动初始化策略引入模式选择先验，并提出Step-SRPO强化学习算法将中间监督引入GRPO框架。 Result: 实验表明，KAT在多种基准测试中始终与当前最先进的模型（如DeepSeek-R1-0528和Qwen3-235B-A22B）持平甚至更优，同时减少了约30%的token使用量；初步训练的200B MoE模型已显示出性能和效率的显著提升。 Conclusion: KAT不仅在学术评估中表现优异，在实际应用如快手内部的Kwaipilot中也显著提升了开发流程的准确性、效率和可控制性，并且其AutoThink范式的可扩展性得到了验证。 Abstract: We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30\%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou's internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.

[28] Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Yupu Liang,Yaping Zhang,Zhiyang Zhang,Zhiyuan Chen,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为同步自我回顾（SSR）的新颖微调范式，用于提升多模态大语言模型在文档图像机器翻译（DIMT）任务中的表现，同时保持其原有的单语OCR能力。

Details

Motivation: 现有的多模态大语言模型在处理文档图像任务（如OCR）中表现出色，但在文档图像机器翻译（DIMT）任务中面临跨模态和跨语言的挑战，且通过监督微调（SFT）增强DIMT能力时容易遗忘原有OCR能力。 Method: 受“双语认知优势”概念启发，本文提出了同步自我回顾（SSR）方法，在生成翻译文本之前提示模型先生成OCR文本，从而利用其强大的单语OCR能力辅助学习跨语言翻译。实验验证了该方法在缓解灾难性遗忘问题以及提升MLLMs在OCR和DIMT任务上的泛化能力方面的有效性。 Result: 实验表明，所提出的SSR方法有效缓解了灾难性遗忘问题，并显著提升了多模态大语言模型在OCR和DIMT任务上的综合性能。 Conclusion: SSR是一种有效的微调范式，能够在提升文档图像机器翻译能力的同时，保持并利用模型原有的OCR能力，为解决类似任务提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.

[29] CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

Yinzhu Quan,Xinrui Li,Ying Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的多智能体系统CRMAgent，用于生成电商私域消息模板，在实际应用中显著提升了营销效果。

Details

Motivation: 大多数商家在撰写吸引人的私域渠道消息方面存在困难，缺乏专业知识和可扩展工具，因此需要一个高效的消息生成解决方案。 Method: 构建了一个包含组内学习、检索适配和规则回退三种模式的多智能体系统，并通过大规模实验验证其效果。 Result: CRMAgent在多个指标上均优于商家原始模板，包括受众匹配度和营销有效性等关键指标。 Conclusion: CRMAgent是一个基于大语言模型的多智能体系统，能够有效帮助电商商家生成高质量的消息模板和写作指导，显著提升营销效果。 Abstract: In e-commerce private-domain channels such as instant messaging and e-mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi-agent system built on large language models (LLMs) that generates high-quality message templates and actionable writing guidance through three complementary modes. First, group-based learning enables the agent to learn from a merchant's own top-performing messages within the same audience segment and rewrite low-performing ones. Second, retrieval-and-adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule-based fallback provides a lightweight zero-shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants' original templates, delivering significant gains in both audience-match and marketing-effectiveness metrics.

[30] MK2 at PBIG Competition: A Prompt Generation Solution

Yuzheng Xu,Tosho Hirasawa,Seiya Kawano,Shota Kato,Tadashi Kozuno

Main category: cs.CL

TL;DR: MK2是一种无需额外训练数据的提示驱动流程，通过迭代优化和创意评选，在从专利生成产品创意方面表现优异。

Details

Motivation: 任务要求系统将真实专利转化为三年内可行的产品创意，需要开发一种无需额外训练数据的高效解决方案。 Method: MK2利用Gemini 2.5迭代编辑提示，并从较弱的输出中嫁接有用的片段；GPT-4.1使用该提示为每个专利生成一个创意；Qwen3-8B通过Elo循环选择最佳提示。 Result: 在三个领域、两种评估者类型和六个标准中，MK2在自动排行榜上名列前茅，并在36项测试中胜出25项，仅在材料化学领域表现滞后。 Conclusion: MK2已经通过轻量级的提示工程从专利中提供了具有竞争力的、具有商业相关性的创意，但在材料化学领域仍需更深入的领域基础。 Abstract: The Patent-Based Idea Generation task asks systems to turn real patents into product ideas viable within three years. We propose MK2, a prompt-centric pipeline: Gemini 2.5 drafts and iteratively edits a prompt, grafting useful fragments from weaker outputs; GPT-4.1 then uses this prompt to create one idea per patent, and an Elo loop judged by Qwen3-8B selects the best prompt-all without extra training data. Across three domains, two evaluator types, and six criteria, MK2 topped the automatic leaderboard and won 25 of 36 tests. Only the materials-chemistry track lagged, indicating the need for deeper domain grounding; yet, the results show that lightweight prompt engineering has already delivered competitive, commercially relevant ideation from patents.

[31] Distillation versus Contrastive Learning: How to Train Your Rerankers

Zhichao Xu,Zhiqi Huang,Shengyao Zhuang,Ashim Gupta,Vivek Srikumar

Main category: cs.CL

TL;DR: 该论文比较了对比学习和知识蒸馏两种方法在训练文本重排序模型中的效果，发现知识蒸馏（尤其是来自更大教师模型时）通常优于对比学习，但在没有更大教师模型的情况下，对比学习仍然是一个强大且可靠的替代方案。

Details

Motivation: 对比学习和知识蒸馏都被广泛用于训练文本重排序模型，但在实际条件下，对于训练交叉编码器排序模型的有效性，两者之间缺乏明确的比较。 Method: 通过使用相同数据集上对不同规模和架构的排序模型进行实证比较，使用知识蒸馏和对比学习两种方法进行训练，并使用一个强大的对比学习模型作为教师模型进行知识蒸馏。 Result: 研究结果显示，从更大规模的教师模型蒸馏知识通常比对比学习产生更好的域内和域外排序性能，这一结果在不同规模和架构的学生模型中均保持一致。然而，当教师模型和学生模型规模相同时，尤其是在域外任务中，这种优势并不明显。 Conclusion: 知识蒸馏在存在更大、更强大的教师模型时是训练小型排序模型的推荐方法，否则对比学习是一个强有力且更可靠的替代方案。 Abstract: Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise.

[32] What Factors Affect LLMs and RLLMs in Financial Question Answering?

Peng Wang,Xuesi Hu,Jiageng Wu,Yuntao Zou,Qiancheng Zhang,Dagang Li

Main category: cs.CL

TL;DR: This paper evaluates how prompting, agentic frameworks, and multilingual methods affect LLMs and RLLMs in finance tasks, showing that RLLMs naturally excel at Long CoT, limiting added value from traditional approaches.

Details

Motivation: There is limited research exploring how different methods can fully unlock the potential of LLMs and RLLMs in the financial domain, despite their growing importance. Method: The study uses five LLMs and three RLLMs to evaluate the effects of prompting methods, agentic frameworks, and multilingual alignment techniques on financial question-answering tasks. Result: 1) Prompting methods and agent frameworks improve LLM performance by simulating Long CoT; 2) RLLMs inherently possess strong Long CoT abilities; 3) Multilingual alignment methods primarily enhance LLMs by extending reasoning length. Conclusion: RLLMs have inherent Long CoT capabilities, making conventional methods less effective in enhancing their performance. Multilingual alignment methods mainly benefit LLMs by extending reasoning length, offering minimal gains for RLLMs. Abstract: Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.

[33] Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

Itai Mondshine,Tzuf Paz-Argaman,Reut Tsarfaty

Main category: cs.CL

TL;DR: 本文研究了不同语言下自动评估指标的有效性，表明基于n-gram的指标在融合性语言中表现较差，而神经网络评估指标更适合低资源语言。

Details

Motivation: 尽管ROUGE等自动n-gram指标广泛用于生成任务（如摘要）评估，但其在英语以外语言中的适用性仍不清楚。 Method: 该论文通过设计大规模评估套件，对八种不同语系的语言进行分析，比较了基于n-gram和神经网络的评估指标与人类判断的相关性。 Result: 研究发现，n-gram指标在融合性语言中与人类评估的相关性较低，而适当的分词可以缓解这一问题。此外，专门用于评估的神经网络指标（如COMET）在低资源语言中表现更佳。 Conclusion: 该论文总结指出，n-gram指标在融合性语言中的表现有限，并主张更多地投资于为评估任务训练的神经网络指标。 Abstract: Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.

[34] Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

Keisuke Ueda,Wataru Hirota,Takuto Asakura,Takahiro Omi,Kosuke Takahashi,Kosuke Arima,Tatsuya Ishigaki

Main category: cs.CL

TL;DR: 该论文研究了如何优化多智能体LLM系统的设计以支持科研创意生成，发现增加代理人数、交互深度和角色多样性可提升创意多样性，而增加批评者多样性则有助于提高提案的可行性。

Details

Motivation: 尽管最近的研究表明LLM之间的结构化对话可以改善生成创意的新颖性和可行性，但这种互动的最佳设计尚不清楚。 Method: 比较了不同代理角色配置、代理数量和对话深度对生成想法的新颖性和可行性的影响，并使用一个代理生成创意而另一个进行批评的实验设置以实现迭代改进。 Result: 研究结果显示，扩大代理人数、加深交互深度以及增加代理角色多样性都能丰富生成创意的多样性。此外，在创意-批评-修订循环中特别增加批评者的多样性进一步提升了最终提案的可行性。 Conclusion: 本研究提供了构建科学创意生成的有效多智能体LLM系统的实用指南，通过增加代理人数、深化交互深度和拓宽代理角色异质性来提高创意的多样性，并通过在创意-批评-修订循环中增加批评者侧的多样性来提升最终提案的可行性。 Abstract: Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at https://github.com/g6000/MultiAgent-Research-Ideator.

[35] The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality

Benjamin Newman,Abhilasha Ravichander,Jaehun Jung,Rui Xin,Hamish Ivison,Yegor Kuznetsov,Pang Wei Koh,Yejin Choi

Main category: cs.CL

TL;DR: 本文探讨了如何通过微调数据提高语言模型生成事实性内容的能力，发现使用模型自认为真实的生成数据进行微调效果最佳。

Details

Motivation: 减少语言模型在生成文本时出现的“幻觉”（即生成不真实的信息），并探索最有效的微调数据类型。 Method: 比较不同类型的微调数据（如高质量事实数据和模型生成的数据）对减少幻觉的影响，并评估多种过滤策略的效果。 Result: 发现使用模型自认为真实的生成数据进行微调，在多个领域中均能显著提升生成内容的事实准确性。 Conclusion: 模型自身的判断可以作为筛选微调数据的有效信号，有助于提升生成内容的事实性。 Abstract: Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.

[36] A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

Lu Xiang,Yang Zhao,Yaping Zhang,Chengqing Zong

Main category: cs.CL

TL;DR: 这篇论文综述了大语言模型（LLMs）在跨学科研究中的应用，分析了其技术方法和实际贡献，并展望了未来的研究方向。

Details

Motivation: 尽管LLMs展现出跨学科潜力，但其在不同学科中的系统性整合研究仍不足，因此需要全面梳理其技术发展和应用情况。 Method: 从技术角度分析了监督微调、检索增强生成、基于代理的方法和工具集成等关键技术，并从应用角度探讨了LLMs在数学、物理、化学、生物以及人文社会科学中的具体作用。 Result: 论文总结了LLMs在跨学科研究中的关键技术方法和应用领域，并展示了其在学科特定任务中的贡献。 Conclusion: 该论文旨在为研究人员提供LLMs在跨学科领域应用的技术发展和应用现状的全面概述，强调其在不同学科中的适应性和有效性，并指出当前挑战和未来研究方向。 Abstract: Large Language Models (LLMs) have demonstrated their transformative potential across numerous disciplinary studies, reshaping the existing research methodologies and fostering interdisciplinary collaboration. However, a systematic understanding of their integration into diverse disciplines remains underexplored. This survey paper provides a comprehensive overview of the application of LLMs in interdisciplinary studies, categorising research efforts from both a technical perspective and with regard to their applicability. From a technical standpoint, key methodologies such as supervised fine-tuning, retrieval-augmented generation, agent-based approaches, and tool-use integration are examined, which enhance the adaptability and effectiveness of LLMs in discipline-specific contexts. From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs. By providing a comprehensive overview of the technical developments and applications in this field, this survey aims to serve as an invaluable resource for the researchers who are navigating the complex landscape of LLMs in the context of interdisciplinary studies.

[37] ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains

Zilu Dong,Xiangqing Shen,Zinong Yang,Rui Xia

Main category: cs.CL

TL;DR: 提出ChainEdit方法，解决大语言模型知识编辑中的逻辑一致性问题，显著提升逻辑泛化能力并保证编辑效果。

Details

Motivation: 当前的大语言模型（LLMs）知识编辑方法在传播涟漪效应到相关事实时难以保持逻辑一致性。 Method: 通过从结构化知识库中自动提取逻辑模式，并与LLM的内部逻辑对齐，ChainEdit动态生成和编辑逻辑关联的知识簇。 Result: 实验显示，ChainEdit在逻辑泛化方面比基线提高了30%以上，同时保持了编辑的可靠性和特异性，并在涟漪效应任务上达到了新的SOTA性能。 Conclusion: ChainEdit是一个结合知识图谱逻辑规则和LLM逻辑推理能力的框架，可以系统地进行链式更新，保持编辑后的内部逻辑一致性。 Abstract: Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs' internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.

[38] Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Selina Heller,Mohamed Ibrahim,David Antony Selby,Sebastian Vollmer

Main category: cs.CL

TL;DR: This paper presents an LLM-based multi-agent system that simulates decision conferences by detecting agreement among agents, showing that these systems can improve group decision-making and support expert elicitation in various domains.

Details

Motivation: The motivation stems from the need to simulate real-world decision conferences using advanced AI techniques, particularly collaborative multi-agent systems, to enhance decision-making processes in complex scenarios. Method: The authors evaluate six distinct LLMs on stance detection and stance polarity detection tasks and incorporate an agreement-detection agent within the system to assess its effectiveness in simulating group decision-making processes. Result: LLMs were found to reliably detect agreement in nuanced debates, and incorporating an agreement-detection agent improved the efficiency and coherence of group deliberations, making them comparable to real-world decision conferences. Conclusion: The study concludes that LLM-based multi-agent systems can effectively simulate decision conferences by reliably detecting agreements in dynamic debates and improving the efficiency and quality of deliberations. Abstract: Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.

[39] Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework

Zishan Xu,Shuyi Xie,Qingsong Lv,Shupei Xiao,Linlin Song,Sui Wenjuan,Fan Lin

Main category: cs.CL

TL;DR: 本文提出了一种用于大型语言模型错误归因的综合框架和数据集，并开发了首个能够同时生成评分、错误归因和反馈的通用判断模型。

Details

Motivation: 为了高效分析模型性能并诊断其回答中的错误，需要开发一种自动化框架来系统地分类和归因错误，而现有的评估模型缺乏这种能力。 Method: 建立了包含6个主要类别和15个次要类别的全面的错误归因框架，基于此框架设计了一个专门用于错误归因的数据集AttriData，并提出了一个在AttriData上进行微调的模型MisAttributionLLM。 Result: 通过广泛的实验和分析验证了所提方法的有效性和鲁棒性。 Conclusion: 该研究提出了一种用于大型语言模型错误归因的综合框架，并开发了首个能够同时生成评分、错误归因和反馈的通用判断模型。 Abstract: With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.

[40] Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study

Marina Luketina,Andrea Benkel,Christoph G. Schuetz

Main category: cs.CL

TL;DR: 这篇论文探讨了大型语言模型（LLMs）在奥地利和欧盟增值税法律框架内辅助法律决策的能力，通过微调和检索增强生成（RAG）方法对LLMs进行实验评估，结果表明LLMs可以有效支持税务专业人员进行VAT任务并提供合法依据的决策，但目前尚未实现完全自动化。

Details

Motivation: 客户通常以自然语言描述案件，使LLMs成为支持自动化决策的首选，从而减轻税务专业人员的工作负担。 Method: 应用微调和检索增强生成（RAG）方法对LLMs进行实验评估。 Result: 研究结果突显了使用LLMs支持税务顾问自动执行常规任务和提供初步分析的潜力，但仍存在处理隐含客户知识和特定背景文档的挑战。 Conclusion: LLMs在适当配置下能有效支持税务专业人员进行VAT任务，并提供合法依据的决策，但目前原型尚未准备好完全自动化。 Abstract: This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.

[41] ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition

Qingliang Meng,Hao Wu,Wei Liang,Wei Xu,Qing Zhao

Main category: cs.CL

TL;DR: This research proposes Iterative LoRA Training (ILT) with Iterative Pseudo Labeling to improve model performance by overcoming overfitting issues in Low-Rank Adaptation during supervised fine-tuning, showing promising results in the Interspeech 2025 challenge.

Details

Motivation: To overcome the overfitting issue commonly observed in Low-Rank Adaptation during the supervised fine-tuning stage and enhance the theoretical upper bound of model performance. Method: An innovative training paradigm called Iterative LoRA Training (ILT) combined with Iterative Pseudo Labeling is proposed to address overfitting in Low-Rank Adaptation during supervised fine-tuning. Systematic experiments are conducted using a three-stage training process: Focus Training, Feed Back Training, and Fix Training, based on Whisper-large-v3 and Qwen2-Audio. Result: Experimental results demonstrate the effectiveness of the proposed method. The technique achieved 4th place in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task) in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge. Conclusion: The proposed Iterative LoRA Training (ILT) combined with an Iterative Pseudo Labeling strategy effectively enhances model performance and demonstrates practical feasibility and strong application potential, as evidenced by the results in the Interspeech 2025 challenge. Abstract: The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.

[42] Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach

Bruno Alexandre Rosa,Hilário Oliveira,Luiz Rodrigues,Eduardo Araujo Oliveira,Rafael Ferreira Mello

Main category: cs.CL

TL;DR: 本研究结合项目反应理论改进机器学习模型对作文连贯性的自动评估，取得了更好的效果。

Details

Motivation: 自动评估论文中的连贯性在教育人工智能领域是一个挑战，传统的机器学习算法通常不考虑分析语料库中实例的个体特征。 Method: 从6,563篇ENEM风格论文和1,235篇巴西葡萄牙语叙事论文中提取了325个语言特征，使用机器学习回归任务进行分析，并引入项目反应理论调整评分。 Result: 实验结果表明，该方法在多个评估指标上优于常规机器学习模型和集成方法。 Conclusion: 研究提出了一种基于项目反应理论的连贯性评分预测方法，并证明其优于传统机器学习模型和集成方法。 Abstract: Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.

[43] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

David Schlangen,Sherzod Hakimov,Jonathan Jordan,Philipp Sadler

Main category: cs.CL

TL;DR: This paper introduces clembench, a flexible and reusable framework for evaluating large language models through dialogue games, combining control and real-world relevance.

Details

Motivation: Current evaluation paradigms for LLMs have limitations in combining control, ecological validity, and goal-directed interactions, which dialogue game-based evaluation aims to address. Method: Introducing clembench, a framework in continuous development since 2023, optimized for general use and extensibility. Result: clembench enables users to benchmark models using predefined game instances and extend the framework with custom tests. Conclusion: clembench provides a mature and easily reusable implementation for dialogue game-based evaluation of LLMs, enabling both benchmarking and extensibility. Abstract: There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.

[44] LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

Shibo Sun,Xue Li,Donglin Di,Mingjie Wei,Lanshun Nie,Wei-Nan Zhang,Dechen Zhan,Yang Song,Lei Fan

Main category: cs.CL

TL;DR: 本文提出了一种名为LLaPa的新框架，结合视觉语言模型与两个辅助模块，有效提升了多模态过程规划的质量。

Details

Motivation: 大型语言模型（LLMs）在通过强大推理能力促进具身AI系统的程序规划方面已经取得了进展，但多模态输入和反事实推理的整合仍有待探索。 Method: 利用视觉语言模型（VLMs）从文本任务描述和视觉环境图像生成可执行的动作序列，并引入任务-环境重排序器（TER）和反事实活动检索器（CAR）两个辅助模块来优化过程规划。 Result: 实验表明，LLaPa在LCS和正确性方面优于先进模型，生成了更高质量的规划方案。 Conclusion: LLaPa是一个用于多模态过程规划的视觉语言模型框架，通过两个辅助模块提高了在ActPlan-1K和ALFRED基准测试中的计划质量。 Abstract: While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.

[45] Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop

Mengze Hong,Chen Jason Zhang,Di Jiang

Main category: cs.CL

TL;DR: This paper investigates integrating Large Language Models (LLMs) into Latent Dirichlet Allocation (LDA) for improved topic modeling, finding that LLM-based post-correction enhances coherence while LLM-guided initialization does not improve long-term convergence.

Details

Motivation: The motivation stems from the high dependency of LDA on initialization quality and the growing influence of LLMs in natural language processing. The authors aim to enhance traditional topic modeling with LLM capabilities and explore whether such integration leads to better performance. Method: The paper proposes two strategies for integrating Large Language Models (LLMs) into the Latent Dirichlet Allocation (LDA) process: LLM-guided initialization of the Gibbs sampling algorithm and LLM-enabled post-correction. The effectiveness is evaluated through extensive experiments on topic coherence. Result: Experimental results show that the LLM-guided initialization improves early iterations of LDA but does not affect convergence and performs worse than baselines. Conversely, LLM-enabled post-correction achieves a 5.86% improvement in coherence evaluation. Conclusion: The study concludes that while LLM-guided initialization does not significantly impact LDA convergence, LLM-enabled post-correction improves topic coherence, challenging the assumption that LLMs are always superior in text mining tasks. Abstract: Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.

[46] PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

Ziyi Huang,Xia Cui

Main category: cs.CL

TL;DR: 本文提出了一种基于特征的框架，用于SemEval 2025任务11中的多标签情感检测。

Details

Motivation: 解决文本情感检测中语言多样性和资源限制带来的挑战。 Method: 设计一个动态适应文档表示和学习算法的特征中心框架，并评估了文档表示、降维和模型训练三个组件。 Result: TF-IDF在低资源语言上效果显著，FastText和Sentence-BERT等嵌入方法表现出语言特定的优势；PCA降低了训练时间且不影响性能。 Conclusion: 该框架为多语言情感检测提供了一个可扩展的解决方案。 Abstract: This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.

[47] The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks

David Pomerenke,Jonas Nothnagel,Simon Ostermann

Main category: cs.CL

TL;DR: This paper introduces the AI Language Proficiency Monitor, a benchmark for evaluating large language models across up to 200 languages, especially emphasizing low-resource ones.

Details

Motivation: To ensure equitable access to the benefits of large language models (LLMs) across the world's languages, particularly focusing on low-resource languages. Method: A comprehensive multilingual benchmark that aggregates diverse tasks like translation, question answering, math, and reasoning using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. Result: An open-source, auto-updating leaderboard and dashboard offering descriptive insights into model performance, including a global proficiency map and trends over time. Conclusion: The AI Language Proficiency Monitor aims to foster transparency, inclusivity, and progress in multilingual AI by evaluating LLMs across up to 200 languages. Abstract: To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world's languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at https://huggingface.co/spaces/fair-forward/evals-for-every-language.

[48] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Benno Uthayasooriyar,Antoine Ly,Franck Vermet,Caio Corro

Main category: cs.CL

TL;DR: DocPolarBERT是一种新的布局感知的BERT模型，利用相对极坐标系中的文本块位置进行自我注意，减少了对绝对2D位置嵌入的依赖，并在较小的预训练数据集上取得了先进的成果。

Details

Motivation: 为了消除对绝对二维位置嵌入的需求，并探索在较小的预训练数据集上也能取得良好效果的模型设计。 Method: 引入了DocPolarBERT，这是一种布局感知的BERT模型，它扩展了自我注意机制，以相对极坐标系而非笛卡尔坐标系考虑文本块的位置。 Result: 尽管预训练数据集的规模不到广泛使用的IIT-CDIP语料库的六分之一，DocPolarBERT仍然取得了最先进的结果。 Conclusion: DocPolarBERT通过精心设计的注意力机制，为文档理解提供了一种高效且有效的替代方案。 Abstract: We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

[49] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Marcin Pietroń,Rafał Olszowski,Jakub Gomułka,Filip Gampel,Andrzej Tomski

Main category: cs.CL

TL;DR: 本研究评估了多个大型语言模型在论点分类任务中的表现，发现ChatGPT-4o和Deepseek-R1效果最好，但也发现了它们的错误模式及提示算法的局限性。

Details

Motivation: 动机是填补对公开可用的论点分类数据库中大型语言模型操作的研究空白，并评估这些模型与传统方法和其他深度学习模型相比的表现。 Method: 该研究使用了多种数据集（如Args.me和UKP），并测试了包括GPT、Llama和DeepSeek在内的多个大型语言模型及其结合思维链算法的推理增强变体。 Result: 结果表明，ChatGPT-4o在论点分类基准测试中表现最佳，而结合推理能力的Deepseek-R1则展现了优越性。然而，即使是最优模型也存在常见错误。此外，研究还指出了现有提示算法的弱点。 Conclusion: 本文得出的结论是，尽管大型语言模型（如ChatGPT-4o和Deepseek-R1）在论点分类基准测试中表现出色，但它们仍然存在一些错误。此外，已知提示算法在论点分析中存在弱点，并提出了改进方向。 Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

[50] The Impact of Automatic Speech Transcription on Speaker Attribution

Cristina Aggazzotti,Matthew Wiesner,Elizabeth Allyn Smith,Nicholas Andrews

Main category: cs.CL

TL;DR: This paper studies the impact of ASR transcription errors on speaker attribution, finding that such errors can improve attribution by capturing speaker-specific features, making ASR-based attribution as effective as human-based methods.

Details

Motivation: Speaker attribution from speech transcripts is crucial when audio is unavailable or unreliable, but real-world scenarios often involve error-prone ASR-generated transcripts rather than human-annotated ones. Method: Comprehensive study comparing speaker attribution performance using human-transcribed data and ASR-transcribed data, analyzing the impact of transcription errors on attribution. Result: Speaker attribution is resilient to transcription errors, and ASR errors may actually aid in identifying speakers by capturing speaker-specific features. Conclusion: ASR transcription errors might reveal speaker-specific features that aid in speaker attribution. Abstract: Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.

[51] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

Jiyao Zhang,Chengli Zhong,Hui Xu,Qige Li,Yi Zhou

Main category: cs.CL

TL;DR: 本文提出了一种新的神经符号框架KELPS，用于将非正式数学数据转化为多种形式化语言（Lean、Coq 和 Isabelle），从而解决了由于多语言平行语料数量和质量问题导致的瓶颈。

Details

Motivation: 现代大型语言模型（LLMs）在将非正式数学形式化为机器可验证定理方面取得了显著进展，但由于多语言平行语料的数量和质量有限，这些方法仍然面临瓶颈。 Method: KELPS 是一个迭代框架，用于将非正式数据翻译、合成和过滤为多种形式化语言。首先，将自然语言转化为知识方程（KEs），然后通过严格定义的规则将其转换为目标语言，这些规则保持了语法结构和语义含义。 Result: 该框架在 MiniF2F 上实现了 88.9% 的语法准确率（pass@1），超过了 Deepseek-V3（81%）和 Herald（81.3%）等当前最先进的模型。 Conclusion: KELPS 提供了一种有效的解决方案，用于将非正式数学形式化为机器可验证的定理，并在多个数据集上表现出色。 Abstract: Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.

[52] KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

Songlin Zhai,Guilin Qi,Yuan Meng

Main category: cs.CL

TL;DR: This paper proposes KGA, a test-time knowledge graph-augmented framework for large language models that enables dynamic knowledge fusion through outward and inward aggregation pathways without parameter updates.

Details

Motivation: Existing KG-enhanced approaches for LLMs rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and limits adaptability to real-time knowledge updates. This work aims to overcome these limitations by proposing a test-time framework that avoids parameter updates. Method: The method introduces a knowledge graph-guided attention (KGA) module that dynamically fuses external knowledge into input representations through outward and inward aggregation pathways. Outward aggregation integrates knowledge via input-driven KG fusion, while inward aggregation filters and refines representations based on KG guidance. Result: Extensive experiments on five benchmarks verify the effectiveness of the KGA module in achieving comparable knowledge fusion performance without modifying model parameters. Conclusion: The proposed KGA module enables real-time knowledge fusion for LLMs at test-time without requiring parameter updates, addressing issues of catastrophic forgetting and limited adaptability in existing methods. Abstract: Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.

[53] Multilingual Multimodal Software Developer for Code Generation

Linzheng Chai,Jian Yang,Shukai Liu,Wei Zhang,Liran Wang,Ke Jin,Tao Sun,Congnan Liu,Chenchen Zhang,Hualei Zhu,Jiaheng Liu,Xianjie Wu,Ge Zhang,Tianyu Liu,Zhoujun Li

Main category: cs.CL

TL;DR: 本文介绍了MM-Coder，一个能够结合统一建模语言（UML）图表和流程图等视觉设计输入与文本指令的多语言多模态软件开发者，旨在提高代码生成的准确性和架构一致性。

Details

Motivation: 大多数大语言模型（LLMs）仍然是纯文本的，忽视了真实软件开发中使用的图表和流程图等视觉辅助工具。 Method: 开发了MMc-Instruct，一个多样化的多模态指令调优数据集，包括基于视觉工作流程的代码生成，并引入了MMEval，一个新的多模态代码生成评估基准。 Result: 评估结果显示，模型在精确捕捉视觉信息、遵循指令和高级编程知识方面仍面临重大挑战。 Conclusion: MM-Coder有望通过解释和实现通过文本和视觉设计传达的复杂规范，彻底改变工业编程。 Abstract: The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.

[54] KV Cache Steering for Inducing Reasoning in Small Language Models

Max Belitsky,Dawid J. Kopiczko,Michael Dorkenwald,M. Jehanzeb Mirza,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CL

TL;DR: This paper introduces 'cache steering,' a lightweight and efficient technique to steer language models toward multi-step reasoning by applying one-shot interventions to the key-value cache, eliminating the need for fine-tuning or prompt changes.

Details

Motivation: To induce chain-of-thought reasoning in small language models through an efficient and lightweight method that avoids continuous interventions and complex adjustments. Method: The study proposes cache steering, which applies a one-shot intervention directly to the key-value cache to implicitly steer language models. It uses GPT-4o-generated reasoning traces to create steering vectors that encourage multi-step reasoning without requiring fine-tuning or prompt modifications. Result: Experimental evaluations show that cache steering improves both the qualitative structure of model reasoning and quantitative task performance across diverse reasoning benchmarks. Conclusion: Cache steering proves to be a robust and practical solution for controlled generation, offering advantages in hyperparameter stability, inference-time efficiency, and ease of integration compared to prior activation steering techniques. Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.

cs.CV [Back]

[55] CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025

Hayato Tanoue,Hiroki Nishihara,Yuma Suzuki,Takayuki Hori,Hiroki Takushima,Aiswariya Manojkumar,Yuki Shibata,Mitsuru Takeda,Fumika Beppu,Zhao Hengwei,Yuto Kanda,Daichi Yamaga

Main category: cs.CV

TL;DR: This paper explores two approaches for multi-view skill assessment, with the two-stage pipeline showing superior performance in estimating proficiency.

Details

Motivation: To address the EgoExo4D Proficiency Estimation Challenge at CVPR 2025 by developing effective methods for multi-view skill assessment. Method: Two methods were proposed: (1) a multi-task learning framework using Sapiens-2B to jointly predict proficiency and scenario labels, and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers. Result: The two-stage pipeline achieves better accuracy (47.8%) compared to the multi-task learning framework (43.6%). Conclusion: The two-stage pipeline approach outperforms the multi-task learning framework in multi-view skill assessment for proficiency estimation. Abstract: This report presents the CuriosAI team's submission to the EgoExo4D Proficiency Estimation Challenge at CVPR 2025. We propose two methods for multi-view skill assessment: (1) a multi-task learning framework using Sapiens-2B that jointly predicts proficiency and scenario labels (43.6 % accuracy), and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers (47.8 % accuracy). The superior performance of the two-stage approach demonstrates the effectiveness of scenario-conditioned modeling for proficiency estimation.

[56] Self-Consistency in Vision-Language Models for Precision Agriculture: Multi-Response Consensus for Crop Disease Management

Mihir Gupta,Abhay Mangla,Ross Greer,Pratik Desai

Main category: cs.CV

TL;DR: 本文提出了一种针对农业图像处理的领域感知框架，结合基于提示的专家评估和自洽机制，提升了视觉-语言模型在精准农业中的可靠性。应用于玉米叶病识别时，该方法显著提高了诊断准确性及其他相关指标。

Details

Motivation: 现有的视觉-语言模型在农业领域的应用中表现不佳，需要一种更可靠的方法来提高精度农业的效果。 Method: 提出了两个创新点：1）基于提示的评估协议，将语言模型配置为植物病理学专家；2）余弦一致性自投票机制，使用多个候选响应并选择语义上最一致的诊断结果。 Result: 在玉米叶病识别任务中，诊断准确率从82.2%提升至87.8%，症状分析从38.9%提升至52.2%，治疗建议从27.8%提升至43.3%。 Conclusion: 提出的框架在精准农业中具有显著的应用潜力，特别是在资源有限的环境中支持实时决策。 Abstract: Precision agriculture relies heavily on accurate image analysis for crop disease identification and treatment recommendation, yet existing vision-language models (VLMs) often underperform in specialized agricultural domains. This work presents a domain-aware framework for agricultural image processing that combines prompt-based expert evaluation with self-consistency mechanisms to enhance VLM reliability in precision agriculture applications. We introduce two key innovations: (1) a prompt-based evaluation protocol that configures a language model as an expert plant pathologist for scalable assessment of image analysis outputs, and (2) a cosine-consistency self-voting mechanism that generates multiple candidate responses from agricultural images and selects the most semantically coherent diagnosis using domain-adapted embeddings. Applied to maize leaf disease identification from field images using a fine-tuned PaliGemma model, our approach improves diagnostic accuracy from 82.2\% to 87.8\%, symptom analysis from 38.9\% to 52.2\%, and treatment recommendation from 27.8\% to 43.3\% compared to standard greedy decoding. The system remains compact enough for deployment on mobile devices, supporting real-time agricultural decision-making in resource-constrained environments. These results demonstrate significant potential for AI-driven precision agriculture tools that can operate reliably in diverse field conditions.

[57] Development of a Canada-Wide Morphology Map for the ITU-R P. 1411 Propagation Model

Jennifer P. T. Nguyen

Main category: cs.CV

TL;DR: This paper presents an automated machine learning method to classify Canadian regions into distinct environmental types, improving the accuracy of radio wave propagation predictions.

Details

Motivation: The motivation behind this research is to address the qualitative nature of environment-type descriptors in the ITU-R Recommendation, aiming to improve the accuracy of path loss estimations for outdoor short-range propagation across various frequencies. Method: The study employs a machine learning approach to automate the classification of regions into residential, urban low-rise, and urban high-rise environments, following the ITU-R P.1411-12 propagation model guidelines. Result: The result of the research is the creation of a Canada-wide morphology map with optimized classification accuracy, enhancing the precision of path loss estimations for outdoor short-range propagation at frequencies from 300 MHz to 100 GHz. Conclusion: The paper concludes that the developed machine learning approach successfully automates the classification process, leading to more accurate path loss estimations for outdoor short-range propagation across a wide range of frequencies in Canada. Abstract: This paper outlines the development of a Canada-wide morphology map classifying regions into residential, urban low-rise, and urban high-rise environments, following the ITU-R P.1411-12 propagation model guidelines. To address the qualitative nature of the environment-type descriptors found in the Recommendation, a machine learning approach is employed to automate the classification process. Extensive experimentation optimized classification accuracy, resulting in a Canada-wide morphology map that ensures more accurate path loss estimations for outdoor short-range propagation at frequencies ranging from 300 MHz to 100 GHz.

[58] Towards Evaluating Robustness of Prompt Adherence in Text to Image Models

Sujith Vemishetty,Advitiya Arora,Anupama Sharma

Main category: cs.CV

TL;DR: 本文提出了一种评估文本到图像模型的新方法，发现这些模型在生成符合提示的图像时仍面临挑战。

Details

Motivation: 多模态大语言模型和文本到图像模型的发展迅速，但其可靠性和稳健性研究不足，因此需要一个全面的评估框架。 Method: 构建了一个新的数据集，并使用gpt-4o生成文本描述，通过Stable Diffusion和Janus模型生成图像，再利用gpt-4o对比生成图像与原始描述的差异。 Result: 实验结果显示，当前模型在生成符合输入提示的图像上表现不佳，特别是在控制简单二值图像的变化因素时。 Conclusion: 文本到图像模型在遵循提示生成图像方面仍存在显著问题，尤其是在控制简单几何形状及其位置等变化因素时。 Abstract: The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.

[59] ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

Debasmit Das,Hyoungwoo Park,Munawar Hayat,Seokeon Choi,Sungrack Yun,Fatih Porikli

Main category: cs.CV

TL;DR: 本文提出 CNTLoRA 方法，在不进行额外训练的情况下改进 LoRA 权重初始化，显著提升微调模型的性能和收敛速度。

Details

Motivation: 现有的 LoRA 权重初始化方法通常随机且固定秩，无法充分利用预训练和微调阶段之间的信息，影响了模型的收敛和性能。 Method: 将 LoRA 初始化表达为领域转移问题，并通过多个约束条件推导出一种闭合形式的权重估计方法，利用预训练权重和微调激活向量进行初始化。 Result: 在图像生成、分类和理解等下游任务中，CNTLoRA 在定量和定性结果上均优于标准和数据驱动的初始化方法，同时经过广泛分析验证了框架设计的有效性。 Conclusion: CNTLoRA 提供了一种无需训练的初始化方法，通过闭合形式的权重估计和灵活的秩分解，提高了 LoRA 微调的收敛速度和性能。 Abstract: Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.

[60] A Hybrid Multilayer Extreme Learning Machine for Image Classification with an Application to Quadcopters

Rolando A. Hernandez-Hernandez,Adrian Rubio-Solis

Main category: cs.CV

TL;DR: This paper introduces HML-ELM, a hybrid learning framework combining ELM autoencoders and Interval Type-2 fuzzy logic, which achieves better performance in image classification and UAV-based object transport than existing methods.

Details

Motivation: To enhance the efficiency of active image classification and improve the performance of UAVs in autonomous tasks, leveraging the strengths of ELM-AE and Interval Type-2 fuzzy logic theory. Method: The paper proposes a Hybrid Multilayer Extreme Learning Machine (HML-ELM) combining ELM-based autoencoders for unsupervised feature extraction and Simplified Interval Type-2 Fuzzy ELM for supervised classification. It also introduces an improved fast output reduction algorithm based on SC and COSTRWSR. Result: Experiments show that HML-ELM outperforms other methods such as ML-ELM, ML-FELM, and ELM in both benchmark image classification tasks and real-world UAV applications. Conclusion: The proposed HML-ELM method demonstrates superior efficiency in image classification and UAV-based object transport compared to existing techniques like ML-ELM, ML-FELM, and ELM. Abstract: Multilayer Extreme Learning Machine (ML-ELM) and its variants have proven to be an effective technique for the classification of different natural signals such as audio, video, acoustic and images. In this paper, a Hybrid Multilayer Extreme Learning Machine (HML-ELM) that is based on ELM-based autoencoder (ELM-AE) and an Interval Type-2 fuzzy Logic theory is suggested for active image classification and applied to Unmanned Aerial Vehicles (UAVs). The proposed methodology is a hierarchical ELM learning framework that consists of two main phases: 1) self-taught feature extraction and 2) supervised feature classification. First, unsupervised multilayer feature encoding is achieved by stacking a number of ELM-AEs, in which input data is projected into a number of high-level representations. At the second phase, the final features are classified using a novel Simplified Interval Type-2 Fuzzy ELM (SIT2-FELM) with a fast output reduction layer based on the SC algorithm; an improved version of the algorithm Center of Sets Type Reducer without Sorting Requirement (COSTRWSR). To validate the efficiency of the HML-ELM, two types of experiments for the classification of images are suggested. First, the HML-ELM is applied to solve a number of benchmark problems for image classification. Secondly, a number of real experiments to the active classification and transport of four different objects between two predefined locations using a UAV is implemented. Experiments demonstrate that the proposed HML-ELM delivers a superior efficiency compared to other similar methodologies such as ML-ELM, Multilayer Fuzzy Extreme Learning Machine (ML-FELM) and ELM.

[61] Lightweight Cloud Masking Models for On-Board Inference in Hyperspectral Imaging

Mazen Ali,António Pereira,Fabio Gentile,Aser Cortines,Sam Mugel,Román Orús,Stelios P. Neophytides,Michalis Mavrovouniotis

Main category: cs.CV

TL;DR: 本研究发现轻量级CNN模型在高光谱图像云层和阴影遮蔽中具有最佳的准确率和计算效率平衡。

Details

Motivation: 云层和阴影遮蔽是高光谱卫星成像中的关键预处理步骤，能够提取高质量、可分析的数据。 Method: 评估了包括XGBoost、LightGBM和卷积神经网络（CNN）在内的多种机器学习方法，特别关注带有特征降维的CNN模型。 Result: 所有提升和CNN模型的准确率均超过93%，其中带有特征降维的CNN模型在准确率、存储需求和推理时间方面表现最佳。 Conclusion: 轻量级人工智能模型在高光谱图像处理中展现出实时处理的潜力，支持基于空间应用的卫星AI系统的发展。 Abstract: Cloud and cloud shadow masking is a crucial preprocessing step in hyperspectral satellite imaging, enabling the extraction of high-quality, analysis-ready data. This study evaluates various machine learning approaches, including gradient boosting methods such as XGBoost and LightGBM as well as convolutional neural networks (CNNs). All boosting and CNN models achieved accuracies exceeding 93%. Among the investigated models, the CNN with feature reduction emerged as the most efficient, offering a balance of high accuracy, low storage requirements, and rapid inference times on both CPUs and GPUs. Variations of this version, with only up to 597 trainable parameters, demonstrated the best trade-off in terms of deployment feasibility, accuracy, and computational efficiency. These results demonstrate the potential of lightweight artificial intelligence (AI) models for real-time hyperspectral image processing, supporting the development of on-board satellite AI systems for space-based applications.

[62] The relative importance of being Gaussian

F. Alberto Grünbaum,Tondgi Xu

Main category: cs.CV

TL;DR: 本研究探讨了专为高斯噪声设计的扩散模型算法在面对其他类型噪声（如均匀分布或Beta分布）时的有效性，发现其性能可能受限，并建议未来研究应关注算法在不同噪声环境中的适应性。

Details

Motivation: 受扩散模型在计算机视觉中取得的显著去噪成果启发，本文旨在探究这些算法是否能在与设计初衷（即高斯噪声环境）截然不同的噪声条件下依然保持有效性。 Method: 该论文通过使用不同类型的噪声（如均匀分布、Beta分布以及随机叠加的双高斯分布）测试当前专为高斯噪声设计的扩散模型算法，直接评估其在未做任何修改情况下的去噪效果。实验是在笔记本电脑和最小图像尺寸下进行的，以探讨算法的适应性。 Result: 论文表明，当将原本针对高斯噪声设计的扩散模型算法用于非高斯噪声环境时，其性能可能会下降。此外，由于实验仅在有限条件下进行（如小型笔记本和最小图像尺寸），进一步探索这些算法在更广泛场景中的表现仍然是一个有趣的挑战。 Conclusion: 研究指出，尽管扩散模型算法在高斯噪声环境下表现优异，但将其应用于如均匀分布、Beta分布或双高斯混合分布等非高斯噪声环境时，算法性能会受到挑战。作者强调了在不同噪声情况下进一步验证观察结果的重要性，并提出了未来研究的方向。 Abstract: The remarkable results for denoising in computer vision using diffusion models given in \cite{SDWMG,HJA,HHG} yield a robust mathematical justification for algorithms based on crucial properties of a sequence of Gaussian independent $N(0,1)$ random variables. In particular the derivations use the fact that a Gaussian distribution is determined by its mean and variance and that the sum of two Gaussians is another Gaussian. \bigskip The issue raised in this short note is the following: suppose we use the algorithm without any changes but replace the nature of the noise and use, for instance, uniformly distributed noise or noise with a Beta distribution, or noise which is a random superposition of two Gaussians with very different variances. One could, of course, try to modify the algorithm keeping in mind the nature of the noise, but this is not what we do. Instead we study the performance of the algorithm when used with noise that is very far in nature from the Gaussian case, where it is designed to work well. Usually these algorithms are implemented on very powerful computers. Our experiments are all carried out on a small laptop and for the smallest possible image size. Exploring how our observations are confirmed or changed when dealing in different situations remains an interesting challenge.

[63] An Object-Based Deep Learning Approach for Building Height Estimation from Single SAR Images

Babak Memar,Luigi Russo,Silvia Liberata Ullo,Paolo Gamba

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于深度学习的方法，用于从超高分辨率合成孔径雷达图像中自动估计建筑物高度，并通过跨城市和跨大陆的数据集验证了其鲁棒性。

Details

Motivation: 准确估计建筑物高度对于各种城市应用至关重要，尤其是在使用超高分辨率合成孔径雷达影像时。 Method: 基于目标的回归方法，包括边界框检测和后续的高度估计，并采用交叉验证策略进行评估。 Result: 模型在欧洲城市的测试中表现出色，平均绝对误差约为一个建筑层（慕尼黑为2.20米），显著优于最近的最先进方法。尽管在其他大洲的城市推广时变异性增加，尤其是亚洲地区，但研究结果仍然显示了良好的性能。 Conclusion: 该研究强调了深度学习在跨城市和跨大陆迁移学习中的巨大潜力，尤其是在利用单个超高分辨率合成孔径雷达数据进行建筑物高度估计方面的稳健性。 Abstract: Accurate estimation of building heights using very high resolution (VHR) synthetic aperture radar (SAR) imagery is crucial for various urban applications. This paper introduces a Deep Learning (DL)-based methodology for automated building height estimation from single VHR COSMO-SkyMed images: an object-based regression approach based on bounding box detection followed by height estimation. This model was trained and evaluated on a unique multi-continental dataset comprising eight geographically diverse cities across Europe, North and South America, and Asia, employing a cross-validation strategy to explicitly assess out-of-distribution (OOD) generalization. The results demonstrate highly promising performance, particularly on European cities where the model achieves a Mean Absolute Error (MAE) of approximately one building story (2.20 m in Munich), significantly outperforming recent state-of-the-art methods in similar OOD scenarios. Despite the increased variability observed when generalizing to cities in other continents, particularly in Asia with its distinct urban typologies and prevalence of high-rise structures, this study underscores the significant potential of DL for robust cross-city and cross-continental transfer learning in building height estimation from single VHR SAR data.

[64] RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration

Chong Cheng,Yu Hu,Sicheng Yu,Beizhen Zhao,Zijian Wang,Hao Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯配准的框架RegGS，用于从无姿态稀疏视角重建场景。通过使用熵正则化Sinkhorn算法解决最优传输混合2-Wasserstein距离问题，实现了局部3D高斯的有效对齐，从而达到精确的姿态估计和高质量的新视角合成。

Details

Motivation: 现有的基于优化的3D高斯斑点方法在处理稀疏视角时因缺乏先验知识而表现不佳，而前馈高斯方法受限于输入格式，难以整合更多视角。为此，本文提出了RegGS来解决这些问题。 Method: 本文提出了一种3D高斯配准框架RegGS，利用熵正则化Sinkhorn算法高效求解最优传输混合2-Wasserstein距离问题，并设计了一个联合3D高斯配准模块，结合了混合2-Wasserstein距离、光度一致性和深度几何信息，实现了从稀疏视角的场景重建。 Result: 在RE10K和ACID数据集上的实验表明，RegGS能够有效注册局部高斯，实现高保真的场景重建、精确的姿态估计和高质量的新视角合成。 Conclusion: RegGS是一种有效的3D高斯配准框架，能够在处理无姿态稀疏视角时提供高质量的场景重建和精确的姿态估计。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: https://3dagentworld.github.io/reggs/.

[65] Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

Hyungjun Doh,Dong In Lee,Seunggeun Chi,Pin-Hao Huang,Kwonjoon Lee,Sangpil Kim,Karthik Ramani

Main category: cs.CV

TL;DR: 提出了一种新的动态人-物交互重建框架，有效解决了单目视频中遮挡和时间不一致的问题。

Details

Motivation: 传统3D重建方法通常假设对象是静态的或动态主体完全可见，当这些假设不成立时（尤其是发生相互遮挡时），性能会下降。 Method: 利用无模板的策略，结合时间上下文，通过逐步优化和稳定重建过程来推断部分遮挡区域的完整结构。 Result: 使用3D高斯随机投影进行验证，表明与现有技术相比，在处理遮挡和保持时间稳定性方面具有更高的精度。 Conclusion: 该框架在处理单目视频中的动态人类与物体交互重建方面表现出色，尤其是在应对遮挡和时间不一致性方面。 Abstract: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.

[66] Adaptive Diffusion Denoised Smoothing : Certified Robustness via Randomized Smoothing with Differentially Private Guided Denoising Diffusion

Frederick Shpilevskiy,Saiyue Lyu,Krishnamurthy Dj Dvijotham,Mathias Lécuyer,Pierre-André Noël

Main category: cs.CV

TL;DR: The paper proposes Adaptive Diffusion Denoised Smoothing, which interprets guided denoising as adaptive privacy mechanisms, providing improved accuracy and robustness certification against adversarial examples.

Details

Motivation: To develop a technique that can certify the predictions of a vision model against adversarial examples while adapting to different inputs. Method: The paper introduces Adaptive Diffusion Denoised Smoothing, reinterpreting guided denoising diffusion as adaptive Gaussian Differentially Private mechanisms to analyze robustness and provide provable certification. Result: The proposed method improves certified and standard accuracy on ImageNet for an ℓ₂ threat model by using a specific guiding strategy. Conclusion: Adaptive Diffusion Denoised Smoothing is a successful method for certifying vision model predictions against adversarial examples, enhancing both certified and standard accuracy. Abstract: We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.

[67] An Embedded Real-time Object Alert System for Visually Impaired: A Monocular Depth Estimation based Approach through Computer Vision

Jareen Anjom,Rashik Iram Chowdhury,Tarbia Hasan,Md. Ishan Arefin Hossain

Main category: cs.CV

TL;DR: A lightweight real-time object detection and depth estimation system was developed using transfer learning and quantization to help visually impaired individuals navigate urban environments safely.

Details

Motivation: Visually impaired people face significant challenges commuting in urban areas of Bangladesh due to frequent road accidents caused by obstacles. There is a critical need for an alert system that can detect nearby objects in real time to prevent collisions and improve safety. Method: The study employed transfer learning for training models on depth estimation and object detection, which were combined into a single system. Quantization techniques were applied to optimize the models for efficiency and lightweight deployment on embedded systems. Result: The proposed system achieved an mAP50 score of 0.801, demonstrating its effectiveness in real-time object detection and depth estimation while remaining optimized for deployment on embedded systems. Conclusion: The research successfully developed a lightweight real-time depth estimation and object detection model using transfer learning and quantization techniques to assist visually impaired individuals in navigating urban environments safely. Abstract: Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.

[68] HNOSeg-XS: Extremely Small Hartley Neural Operator for Efficient and Resolution-Robust 3D Image Segmentation

Ken C. L. Wong,Hongzhi Wang,Tanveer Syeda-Mahmood

Main category: cs.CV

TL;DR: 提出了一种新的医学图像分割架构HNOSeg-XS，其具有优秀的分辨率鲁棒性、快速推理速度和高内存效率，并在多个数据集中得到了验证。

Details

Motivation: 为了克服卷积神经网络和变压器在医学图像分割中的局限性，包括计算成本高、内存占用大以及输入尺寸缩减导致的次优结果。 Method: 通过使用Hartley变换替代Fourier变换并在频域中重新表述问题来创建HNOSeg-XS模型。 Result: HNOSeg-XS模型在BraTS'23、KiTS'23和MVSeg'23数据集测试中展示了优越的分辨率鲁棒性（参数少于34.7k），并实现了最佳的推理时间（<0.24秒）和内存效率（<1.8 GiB）。 Conclusion: HNOSeg-XS模型在医学图像分割任务中表现出色，具有良好的分辨率鲁棒性、快速推理速度和高内存效率。 Abstract: In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with reduced receptive fields and abstract levels. For transformers, although long-range correlations can be captured by multi-head attention, its quadratic complexity with respect to input size is computationally demanding. Therefore, either model may require input size reduction to allow more filters and layers for better segmentation. Nevertheless, given their discrete nature, models trained with patch-wise training or image downsampling may produce suboptimal results when applied on higher resolutions. To address this issue, here we propose the resolution-robust HNOSeg-XS architecture. We model image segmentation by learnable partial differential equations through the Fourier neural operator which has the zero-shot super-resolution property. By replacing the Fourier transform by the Hartley transform and reformulating the problem in the frequency domain, we created the HNOSeg-XS model, which is resolution robust, fast, memory efficient, and extremely parameter efficient. When tested on the BraTS'23, KiTS'23, and MVSeg'23 datasets with a Tesla V100 GPU, HNOSeg-XS showed its superior resolution robustness with fewer than 34.7k model parameters. It also achieved the overall best inference time (< 0.24 s) and memory efficiency (< 1.8 GiB) compared to the tested CNN and transformer models.

[69] SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

Jackson Borchardt,Saul Kato

Main category: cs.CV

TL;DR: SurfDist是一个用于三维体积实例分割的卷积神经网络架构，它可以预测由平滑参数化表面补丁组成的闭合表面实例。

Details

Motivation: 为了克服StarDist-3D模型中实例参数化维度和体素分辨率之间的耦合，并避免上采样时产生的体素化伪影。 Method: SurfDist修改了StarDist-3D架构，允许预测可以上采样到任意高分辨率的闭合表面实例，且不产生体素化伪影。 Result: 对于生物医学成像中的blob形实例数据集，SurfDist能够以更紧凑的实例参数化形式超越StarDist-3D的表现。 Conclusion: 实验结果表明，与StarDist-3D相比，SurfDist可以有效地学习解释性的实例表面模型，并在某些数据集上表现更优。 Abstract: We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic B\'ezier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D's coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist's technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.

[70] Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework

Seoik Jung,Taekyung Song

Main category: cs.CV

TL;DR: This paper adapts the CLIP-EBC framework for car object counting and localization, achieving strong results and introducing a new clustering method for position estimation.

Details

Motivation: The motivation is to explore whether the CLIP-EBC framework, initially designed for crowd counting, can be effectively adapted for car object counting and potentially extended to localization tasks. Method: The study applies the CLIP-EBC framework to car object counting using the CARPK dataset and proposes a K-means weighted clustering method to estimate object positions from predicted density maps. Result: The model achieved second-best performance compared to existing methods, and the proposed K-means weighted clustering method demonstrated potential for estimating object positions based on density maps. Conclusion: The CLIP-EBC framework shows potential for car object counting and localization tasks, achieving second-best performance and introducing a K-means weighted clustering method for position estimation. Abstract: In this paper, we investigate the applicability of the CLIP-EBC framework, originally designed for crowd counting, to car object counting using the CARPK dataset. Experimental results show that our model achieves second-best performance compared to existing methods. In addition, we propose a K-means weighted clustering method to estimate object positions based on predicted density maps, indicating the framework's potential extension to localization tasks.

[71] Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification

Jason Kahei Tam,Murilo Gustineli,Anthony Miyaguchi

Main category: cs.CV

TL;DR: 本研究探讨了在FungiCLEF 2025竞赛中应对真菌种类识别挑战的方法，重点实验了视觉Transformer、数据增强和多模态学习策略，结果表明基于视觉的模型优于生成式AI模型。

Details

Motivation: 由于真菌物种之间存在细微的种间差异和高度的种内差异，准确识别真菌种类在计算机视觉中是一项独特的挑战。这项研究旨在解决这一挑战，并探索适合细粒度视觉分类任务的方法。 Method: 论文中提到的方法包括使用多个视觉Transformer模型、数据增强、加权采样、结合文本信息以及探索用于零样本分类的生成式AI模型。 Result: 最终模型的表现优于竞赛中的基线模型，并在私有测试集后评估中取得了35/74的排名，表明所采用的方法具有一定效果，但也提示需要进一步改进。 Conclusion: 该论文的结论是，基于视觉的模型在FungiCLEF 2025竞赛中表现优于生成式AI模型，并强调了领域特定预训练和平衡采样策略的有效性。此外，方法的排名表明在元数据选择和领域适应的多模态学习方面还有改进的空间。 Abstract: Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at https://github.com/dsgt-arc/fungiclef-2025.

[72] Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone

J. D. Peiffer,Kunal Shah,Irina Djuraskovic,Shawana Anarwala,Kayan Abdou,Rujvee Patel,Prakash Jayabalan,Brenton Pennicooke,R. James Cotton

Main category: cs.CV

TL;DR: The study introduces a smartphone-based system called the Portable Biomechanics Laboratory (PBL), which accurately captures biomechanical data and shows promise as a reliable and accessible tool for monitoring mobility impairments in clinical settings.

Details

Motivation: Movement is an important indicator of neurological and musculoskeletal health, but there is currently a lack of accessible and validated methods to objectively measure movement in clinical settings. This gap limits the use of biomechanical measurements for early detection and sensitive outcome tracking. Method: The researchers developed a secure, cloud-enabled smartphone app and a novel algorithm to fit biomechanical models to collected data. They validated the PBL's biomechanical measures using a large, clinically representative dataset and tested its usability in neurosurgery and sports medicine clinics. Result: Joint angle errors were within 3 degrees across diverse participant groups. Gait metrics computed by PBL showed high reliability and sensitivity to clinical differences. For example, they correlated with mJOA scores and were more responsive to surgical intervention than patient-reported outcomes. Conclusion: The study concludes that the Portable Biomechanics Laboratory (PBL) offers a scalable, low-burden solution for capturing clinically meaningful biomechanical data using smartphone video, making it a promising tool for accessible monitoring of mobility impairments. Abstract: The way a person moves is a direct reflection of their neurological and musculoskeletal health, yet it remains one of the most underutilized vital signs in clinical practice. Although clinicians visually observe movement impairments, they lack accessible and validated methods to objectively measure movement in routine care. This gap prevents wider use of biomechanical measurements in practice, which could enable more sensitive outcome measures or earlier identification of impairment. We present our Portable Biomechanics Laboratory (PBL), which includes a secure, cloud-enabled smartphone app for data collection and a novel algorithm for fitting biomechanical models to this data. We extensively validated PBL's biomechanical measures using a large, clinically representative dataset. Next, we tested the usability and utility of our system in neurosurgery and sports medicine clinics. We found joint angle errors within 3 degrees across participants with neurological injury, lower-limb prosthesis users, pediatric inpatients, and controls. In addition to being easy to use, gait metrics computed from the PBL showed high reliability and were sensitive to clinical differences. For example, in individuals undergoing decompression surgery for cervical myelopathy, the mJOA score is a common patient-reported outcome measure; we found that PBL gait metrics correlated with mJOA scores and demonstrated greater responsiveness to surgical intervention than the patient-reported outcomes. These findings support the use of handheld smartphone video as a scalable, low-burden tool for capturing clinically meaningful biomechanical data, offering a promising path toward accessible monitoring of mobility impairments. We release the first clinically validated method for measuring whole-body kinematics from handheld smartphone video at https://intelligentsensingandrehabilitation.github.io/MonocularBiomechanics/ .

[73] Cross-Resolution SAR Target Detection Using Structural Hierarchy Adaptation and Reliable Adjacency Alignment

Jiang Qin,Bin Zou,Haolin Li,Lamei Zhang

Main category: cs.CV

TL;DR: This paper proposes CR-Net, a novel SAR target detection method that uses structure priors and evidential learning theory to improve domain adaptation across resolutions, achieving state-of-the-art performance.

Details

Motivation: Increasing SAR resolution leads to challenges in generalization ability of target detection models due to discrepancies in scattering characteristics. Domain adaptation is a potential solution, but resolution differences cause blind feature adaptation and unreliable semantic propagation. Method: The paper proposes CR-Net, which integrates Structure-induced Hierarchical Feature Adaptation (SHFA) and Reliable Structural Adjacency Alignment (RSAA). SHFA establishes structural correlations for feature adaptation, while RSAA improves semantic alignment using secure adjacency sets. Result: Based on experimental results from different-resolution datasets, CR-Net significantly enhances cross-resolution adaptation and achieves SOTA performance in cross-resolution SAR target detection. Conclusion: CR-Net achieves state-of-the-art (SOTA) performance in cross-resolution SAR target detection by preserving intra-domain structures and improving discriminability. Abstract: In recent years, continuous improvements in SAR resolution have significantly benefited applications such as urban monitoring and target detection. However, the improvement in resolution leads to increased discrepancies in scattering characteristics, posing challenges to the generalization ability of target detection models. While domain adaptation technology is a potential solution, the inevitable discrepancies caused by resolution differences often lead to blind feature adaptation and unreliable semantic propagation, ultimately degrading the domain adaptation performance. To address these challenges, this paper proposes a novel SAR target detection method (termed CR-Net), that incorporates structure priors and evidential learning theory into the detection model, enabling reliable domain adaptation for cross-resolution detection. To be specific, CR-Net integrates Structure-induced Hierarchical Feature Adaptation (SHFA) and Reliable Structural Adjacency Alignment (RSAA). SHFA module is introduced to establish structural correlations between targets and achieve structure-aware feature adaptation, thereby enhancing the interpretability of the feature adaptation process. Afterwards, the RSAA module is proposed to enhance reliable semantic alignment, by leveraging the secure adjacency set to transfer valuable discriminative knowledge from the source domain to the target domain. This further improves the discriminability of the detection model in the target domain. Based on experimental results from different-resolution datasets,the proposed CR-Net significantly enhances cross-resolution adaptation by preserving intra-domain structures and improving discriminability. It achieves state-of-the-art (SOTA) performance in cross-resolution SAR target detection.

[74] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

Kui Jiang,Shiyu Liu,Junjun Jiang,Xin Yang,Hongxun Yang,Xiaopeng Fan

Main category: cs.CV

TL;DR: 本文提出了一種新的音頻驅動的說話頭生成框架M2DAO-Talker，通過多粒度運動解耦和交替優化策略，提高了生成視頻的質量和真實感。

Details

Motivation: 現有的3D方法在表示穩定、細粒度的運動場方面存在限制，導致產生運動模糊、時間抖動和局部穿透等渲染偽影。 Method: 作者系統地分析並重新制定了說話頭生成的統一框架，包括三個步驟：視頻預處理、運動表示和渲染重建。他們設計了一個新的2D肖像預處理管道，採用多粒度運動解耦策略，並設計了運動一致性約束和交替優化策略。 Result: 實驗結果顯示，M2DAO-Talker在生成質量上提高了2.43 dB PSNR，在用戶評估的視頻真實感上提高了0.64，推理速度達到了每秒150幀。 Conclusion: M2DAO-Talker是一種有效的音頻驅動說話頭生成方法，解決了當前方法的局限性，具有很高的應用潛力。 Abstract: Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization.Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy.Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation.Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is https://m2dao-talker.github.io/M2DAO-Talk.github.io

[75] Cross-Domain Identity Representation for Skull to Face Matching with Benchmark DataSet

Ravi Shankar Prasad,Dinesh Singh

Main category: cs.CV

TL;DR: This paper presents a Siamese network-based approach for craniofacial reconstruction in forensic science, enabling identification of individuals from skull X-ray images by comparing them with known face images.

Details

Motivation: Craniofacial reconstruction is essential in forensic science for identifying victims, and recent advancements like deep learning offer promising tools for this purpose. Method: The authors employed convolutional Siamese networks to create a feature space for comparing skull X-ray images with optical face images. They used Euclidean distance minimization and maximization strategies during training. Result: The experimental results demonstrated satisfactory performance on identifying individuals from given skull X-ray images when compared to known face images. Conclusion: The study concludes that the proposed Siamese network framework can effectively identify individuals from skull X-ray images using cross-domain identity representation. Abstract: Craniofacial reconstruction in forensic science is crucial for the identification of the victims of crimes and disasters. The objective is to map a given skull to its corresponding face in a corpus of faces with known identities using recent advancements in computer vision, such as deep learning. In this paper, we presented a framework for the identification of a person given the X-ray image of a skull using convolutional Siamese networks for cross-domain identity representation. Siamese networks are twin networks that share the same architecture and can be trained to discover a feature space where nearby observations that are similar are grouped and dissimilar observations are moved apart. To do this, the network is exposed to two sets of comparable and different data. The Euclidean distance is then minimized between similar pairs and maximized between dissimilar ones. Since getting pairs of skull and face images are difficult, we prepared our own dataset of 40 volunteers whose front and side skull X-ray images and optical face images were collected. Experiments were conducted on the collected cross-domain dataset to train and validate the Siamese networks. The experimental results provide satisfactory results on the identification of a person from the given skull.

[76] Interpretability-Aware Pruning for Efficient Medical Image Analysis

Nikita Malik,Pratinav Seth,Neeraj Kumar Singh,Chintan Chitroda,Vinay Kumar Sankarapu

Main category: cs.CV

TL;DR: 本文介绍了一种新的可解释性引导剪枝方法，有效减少了深度学习模型的复杂性，同时保持了预测性能和透明度，适用于医学图像分析。

Details

Motivation: 尽管深度学习在医学图像分析中取得了显著进展，但其在临床实践中的应用仍受到现代模型规模庞大和缺乏透明性的限制。 Method: 引入了一种基于可解释性技术（如DL-Backtrace、Layer-wise Relevance Propagation和Integrated Gradients）的剪枝框架，通过选择性保留每一层中最重要的部分来实现有针对性的压缩。 Result: 在多个医学图像分类基准上的实验表明，这种方法实现了较高的压缩率且准确率损失很小。 Conclusion: 本文提出了一种可解释性指导的剪枝框架，能够在保持预测性能和透明度的同时减少模型复杂度，为医疗保健环境中轻量级、可解释模型的实际部署铺平了道路。 Abstract: Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.

[77] CoCo-Bot: Energy-based Composable Concept Bottlenecks for Interpretable Generative Models

Sangwon Kim,In-su Jang,Pyongkun Kim,Kwang-Ju Kim

Main category: cs.CV

TL;DR: CoCo-Bot is a post-hoc, composable concept bottleneck generative model that improves concept-level controllability and interpretability without relying on auxiliary cues.

Details

Motivation: Previous generative Concept Bottleneck Models (CBMs) rely on auxiliary visual cues that undermine interpretability and compositionality. The authors aim to eliminate the need for these cues. Method: CoCo-Bot uses diffusion-based energy functions to guide generative modeling through explicit, human-understandable concepts. Result: Experiments using StyleGAN2 pre-trained on CelebA-HQ show that CoCo-Bot improves concept-level controllability and interpretability while maintaining competitive visual quality. Conclusion: CoCo-Bot, a post-hoc, composable concept bottleneck generative model, improves concept-level controllability and interpretability without the need for auxiliary cues. Abstract: Concept Bottleneck Models (CBMs) provide interpretable and controllable generative modeling by routing generation through explicit, human-understandable concepts. However, previous generative CBMs often rely on auxiliary visual cues at the bottleneck to compensate for information not captured by the concepts, which undermines interpretability and compositionality. We propose CoCo-Bot, a post-hoc, composable concept bottleneck generative model that eliminates the need for auxiliary cues by transmitting all information solely through explicit concepts. Guided by diffusion-based energy functions, CoCo-Bot supports robust post-hoc interventions-such as concept composition and negation-across arbitrary concepts. Experiments using StyleGAN2 pre-trained on CelebA-HQ show that CoCo-Bot improves concept-level controllability and interpretability, while maintaining competitive visual quality.

[78] Single-Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement

Jia-Xuan Jiang,Jiashuai Liu,Hongtao Wu,Yifeng Wu,Zhong Wang,Qi Bi,Yefeng Zheng

Main category: cs.CV

TL;DR: This paper proposes a novel approach to improve the cross-cancer generalization of multimodal prognosis models using two new modules, SDIR and CADE, achieving better performance in unseen cancer types.

Details

Motivation: The motivation was to address the lack of robust generalization in existing multimodal prognosis models when applied to different cancer types, which is critical for clinical practice. Method: The authors introduced two modules: Sparse Dirac Information Rebalancer (SDIR) for enhancing weaker modality signals and Cancer-aware Distribution Entanglement (CADE) for synthesizing target domain distribution by fusing morphological cues and gene expression data. Result: Experiments on a four-cancer-type benchmark showed superior generalization performance of the proposed method, establishing a foundation for practical cross-cancer multimodal prognosis. Conclusion: The authors concluded that their proposed method enhances the generalization of multimodal prognosis models across different cancer types, addressing challenges related to feature degradation and ineffective integration. Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG

[79] Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation

Junxue Yang,Xin Liao,Weixuan Tang,Jianhua Yang,Zheng Qin

Main category: cs.CV

TL;DR: 本文提出了一种名为MRAG的新型深度隐写框架，结合卷积与变换器的优势，并采用频率分解和新损失函数优化隐写效果，显著提升了对抗检测能力。

Details

Motivation: 现有深度学习隐写方法因载荷大、仅依赖单一卷积或变换器特征提取以及像素级损失约束而易被检测，需要提升其对抗检测能力。 Method: 提出了多范围表示驱动的对抗隐写生成框架MRAG，利用粗粒度和细粒度频率分解输入，并设计了特征角度-范数解耦损失来优化隐写过程。 Result: 实验表明，MRAG在彩色JPEG图像隐写中达到了最先进的性能。 Conclusion: MRAG框架通过结合卷积和变换器的特性，引入基于角度-范数解耦的损失函数，在保持信息隐藏效果的同时增强了对抗检测能力。 Abstract: Deep hiding has been exploring the hiding capability of deep learning-based models, aiming to conceal image-level messages into cover images and reveal them from generated stego images. Existing schemes are easily detected by steganalyzers due to their large payloads and their limitation to feature extraction based solely on either pure convolution or pure transformer operators within a single range, as well as pixel-level loss constraints. To address the issue, in this paper, we introduce generation-based adversarial attacks into color JPEG image deep hiding and propose a multi-range representations-driven adversarial stego generation framework called MRAG from a steganalysis perspective. Specifically, we integrate the local-range neighbor reception characteristic of the convolution and the global-range dependency modeling of the transformer to construct MRAG. Meanwhile, we use the transformed images obtained through coarse-grained and fine-grained frequency decomposition as inputs, introducing multi-grained information. Furthermore, a features angle-norm disentanglement loss is designed to constrain the generated stegos closer to covers in the angle and norm space of the steganalyzer's classified features. Consequently, small yet effective adversarial perturbations can be injected into the process of generating stegos, ensuring that stegos maintain favorable secret restorability and imperceptibility. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.

[80] MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

Jihao Gu,Fei Wang,Kun Li,Yanyan Wei,Zhiliang Wu,Dan Guo

Main category: cs.CV

TL;DR: This paper introduces MM-Gesture, a highly effective multimodal fusion framework for recognizing micro-gestures, which ranked 1st in the 3rd MiGA Challenge at IJCAI 2025 with a top-1 accuracy of 73.213%.

Details

Motivation: Micro-gestures (MGs) are subtle and short-duration movements that are difficult to recognize. The motivation behind this work is to improve recognition performance by leveraging multiple modalities and advanced deep learning techniques. Method: The method integrates complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. It employs PoseConv3D and Video Swin Transformer architectures along with a novel modality-weighted ensemble strategy. Transfer learning pre-trained on the MA-52 dataset was used to enhance RGB modality performance. Result: Extensive experiments on the iMiGUE benchmark validated the effectiveness of the proposed approach, achieving superior performance compared to existing methods and securing 1st place in the challenge. Conclusion: MM-Gesture, a multimodal fusion framework developed by the HFUT-VUT team, achieved 1st place in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025 with a top-1 accuracy of 73.213%, outperforming previous state-of-the-art methods. Abstract: In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%.

[81] Cycle Context Verification for In-Context Medical Image Segmentation

Shishuai Hu,Zehui Liao,Liangli Zhen,Huazhu Fu,Yong Xia

Main category: cs.CV

TL;DR: This paper proposes Cycle Context Verification (CCV), a novel framework that improves in-context learning (ICL) for medical image segmentation by enabling self-verification of predictions and better alignment between query and in-context pairs.

Details

Motivation: In-context learning (ICL) shows promise for universal medical image segmentation but suffers from performance sensitivity to the alignment between query images and in-context pairs. The scarcity of annotated medical data and the limitations of fine-tuning further necessitate a more effective and efficient approach. Method: The CCV framework uses a cyclic pipeline where the model generates a segmentation mask for a query image, then swaps roles with an in-context pair to validate its prediction. A query-specific prompt is introduced to enhance alignment between the query and in-context pairs based on this self-verification process. Result: Cycle Context Verification demonstrated superior performance over existing methods across seven medical image segmentation datasets using two ICL foundation models, highlighting its effectiveness in enhancing segmentation accuracy and robustness. Conclusion: The proposed Cycle Context Verification (CCV) framework significantly enhances ICL-based medical image segmentation by improving contextual alignment and robustness, making it a promising solution for universal medical image segmentation. Abstract: In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV's ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at https://github.com/ShishuaiHu/CCV.

[82] Understanding Driving Risks using Large Language Models: Toward Elderly Driver Assessment

Yuki Yoshihara,Linjing Jiang,Nihan Karatas,Hitoshi Kanamori,Asuka Harada,Takahiro Tanaka

Main category: cs.CV

TL;DR: This paper demonstrates that ChatGPT-4o can assist in driving risk assessments for elderly drivers when supported by effective prompt design, showing improved performance across key metrics.

Details

Motivation: This research explores whether multimodal large language models (LLMs) can perform human-like interpretations of traffic scenes from static dashcam images, focusing on tasks relevant to elderly driver assessments that require contextual reasoning. Method: The study evaluated the performance of ChatGPT-4o using zero-shot, few-shot, and multi-shot prompting strategies on three traffic-related judgment tasks: assessing traffic density, intersection visibility, and stop sign recognition. Human annotations were used as the reference standard, and precision, recall, and F1-score were used as evaluation metrics. Result: Prompt design significantly affected model performance. For intersection visibility, recall increased from 21.7% (zero-shot) to 57.0% (multi-shot). Traffic density agreement rose from 53.5% to 67.6%. Stop-sign detection showed high precision (up to 86.3%) but lower recall (~76.7%), indicating conservative responses. Both humans and the model struggled with ambiguous scenes, though the model’s explanations aligned with its predictions, improving interpretability. Conclusion: With well-designed prompts, LLMs like ChatGPT-4o show potential as supportive tools for scene-level driving risk assessments, particularly for elderly drivers. However, scalability and performance improvements require further study with larger datasets, diverse annotators, and advanced model architectures. Abstract: This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model's explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.

[83] Unsupervised Methods for Video Quality Improvement: A Survey of Restoration and Enhancement Techniques

Alexandra Malyugina,Yini Li,Joanne Lin,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: This survey reviews unsupervised video restoration and enhancement techniques, categorizing methods and loss functions while emphasizing the importance of synthetic data for evaluation.

Details

Motivation: Video restoration and enhancement are essential pre-processing steps for improving visual quality and enhancing the performance of various computer vision applications. Understanding the strengths and limitations of current unsupervised techniques is vital for advancing this field. Method: The authors conducted a comprehensive review of existing literature on video restoration and enhancement, focusing on unsupervised approaches. They categorized these methods based on their fundamental strategies and loss functions, and discussed the use of synthetic datasets for evaluation. Result: The survey provides an organized overview of unsupervised video restoration and enhancement methods, categorizing them into domain translation, self-supervision signal design, and blind spot or noise-based approaches. It also outlines different loss functions and highlights the importance of synthetic datasets for objective evaluation. Conclusion: The paper concludes that unsupervised video restoration and enhancement techniques are crucial for improving visual quality and boosting the performance of computer vision tasks, with several challenges and opportunities identified for future research. Abstract: Video restoration and enhancement are critical not only for improving visual quality, but also as essential pre-processing steps to boost the performance of a wide range of downstream computer vision tasks. This survey presents a comprehensive review of video restoration and enhancement techniques with a particular focus on unsupervised approaches. We begin by outlining the most common video degradations and their underlying causes, followed by a review of early conventional and deep learning methods-based, highlighting their strengths and limitations. We then present an in-depth overview of unsupervised methods, categorise by their fundamental approaches, including domain translation, self-supervision signal design and blind spot or noise-based methods. We also provide a categorization of loss functions employed in unsupervised video restoration and enhancement, and discuss the role of paired synthetic datasets in enabling objective evaluation. Finally, we identify key challenges and outline promising directions for future research in this field.

[84] From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Sen Wang,Shao Zeng,Tianjun Gu,Zhizhong Zhang,Ruixin Zhang,Shouhong Ding,Jingyun Zhang,Jun Wang,Xin Tan,Yuan Xie,Lizhuang Ma

Main category: cs.CV

TL;DR: This paper introduces GEFU, a new paradigm for low-light vision that bridges enhancement and understanding, achieving superior performance through SCUF and leveraging diffusion models.

Details

Motivation: Traditional approaches treat low-level enhancement and high-level visual understanding separately, limiting generalization and scalability. The authors aim to create a unified approach that enhances performance across various low-light vision tasks. Method: The paper proposes GEFU, combining generalized enhancement with understanding. SCUF is introduced, using an illumination-aware image prompt and a cycle-attention adapter, along with caption and reflectance consistency. Result: Extensive experiments show that the proposed GEFU method surpasses existing techniques in both traditional image quality metrics and downstream tasks like classification, detection, and segmentation. Conclusion: The paper concludes that GEFU improves both generalization and scalability in low-light vision tasks, outperforming current state-of-the-art methods. Abstract: Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.

[85] Smelly, dense, and spreaded: The Object Detection for Olfactory References (ODOR) dataset

Mathias Zinnen,Prathmesh Madhu,Inger Leemans,Peter Bell,Azhar Hussian,Hang Tran,Ali Hürriyetoğlu,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: The ODOR dataset enhances computer vision research in the humanities with 38,116 detailed annotations across 139 fine-grained categories, addressing limitations in existing datasets and enabling exploration of object recognition and smell perception.

Details

Motivation: Existing datasets are biased towards the image center and lack detailed object class annotations, limiting their applicability in real-world humanities tasks involving artistic abstraction and subtle class differences. This gap motivated the creation of the ODOR dataset. Method: The authors conducted a statistical analysis of the dataset to showcase its properties, provided baseline analyses for object detection models, and highlighted challenges through secondary studies. Result: The proposed ODOR dataset contains 38,116 object-level annotations across 4,712 images, spanning 139 fine-grained categories. It demonstrates challenging properties like dense overlapping objects, spatial distribution across the image canvas, and detailed category distinctions. Conclusion: The ODOR dataset contributes to the field of computer vision in the humanities by addressing gaps in existing datasets, offering a comprehensive set of object-level annotations with fine-grained categories, and challenging researchers to explore new areas such as the intersection of object recognition and smell perception. Abstract: Real-world applications of computer vision in the humanities require algorithms to be robust against artistic abstraction, peripheral objects, and subtle differences between fine-grained target classes. Existing datasets provide instance-level annotations on artworks but are generally biased towards the image centre and limited with regard to detailed object classes. The proposed ODOR dataset fills this gap, offering 38,116 object-level annotations across 4712 images, spanning an extensive set of 139 fine-grained categories. Conducting a statistical analysis, we showcase challenging dataset properties, such as a detailed set of categories, dense and overlapping objects, and spatial distribution over the whole image canvas. Furthermore, we provide an extensive baseline analysis for object detection models and highlight the challenging properties of the dataset through a set of secondary studies. Inspiring further research on artwork object detection and broader visual cultural heritage studies, the dataset challenges researchers to explore the intersection of object recognition and smell perception.

[86] Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao,Beier Zhu,Liang Yao,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像生成框架CoDi，在保持主题一致性的同时实现了姿势和布局的多样性。

Details

Motivation: 现有的无训练SCG方法往往在保持一致性时牺牲了布局和姿势的多样性，阻碍了表达性的视觉叙事。 Method: 提出了一个两阶段策略：身份传输（IT）和身份优化（IR）。IT在去噪早期使用最优传输以姿态感知的方式传输身份特征；IR在后期选择最显著的身份特征来进一步优化主题细节。 Result: 在多个指标上进行了广泛的定性和定量实验，结果表明CoDi在视觉感知和性能方面均表现出色。 Conclusion: CoDi是一个有效的文本到图像生成框架，能够在保持主题身份一致性的同时实现姿势和布局的多样性。 Abstract: Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

[87] PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models

Yongjian Zhang,Longguang Wang,Kunhong Li,Ye Zhang,Yun Wang,Liang Lin,Yulan Guo

Main category: cs.CV

TL;DR: PanMatch是一种适用于多种对应匹配任务的通用基础模型，无需任务专用架构即可实现强大的跨任务和零样本性能。

Details

Motivation: 传统方法依赖于特定任务的架构和领域微调，而PanMatch旨在通过统一模型处理多种对应匹配任务，提高模型的泛化能力。 Method: 提出了一种2D位移估计框架，并设计了一个特征转换流程，利用大型视觉模型的通用特征进行匹配。同时构建了一个包含近180万个样本的跨领域数据集用于预训练。 Result: PanMatch在跨任务评估中优于UniMatch和Flow-Anything，在特定任务基准上表现与大多数最先进的任务专用算法相当，并且在雨天、卫星图像等异常场景下具有优异的零样本性能。 Conclusion: PanMatch是一个多功能的基础模型，用于鲁棒的对应匹配。它通过使用相同的模型权重来实现多任务集成，并在异常场景中展现出前所未有的零样本性能。 Abstract: This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.

[88] Deep Hashing with Semantic Hash Centers for Image Retrieval

Li Chen,Rui Liu,Yuxiang Zhou,Xudong Ma,Yong Chen,Dell Zhang

Main category: cs.CV

TL;DR: This paper proposes SHC, a deep hashing method using semantic hash centers that consider class relationships, achieving significant improvements in large-scale image retrieval tasks.

Details

Motivation: Existing deep hashing methods generate data-independent hash centers, ignoring semantic relationships between classes, which can degrade retrieval performance. Method: A three-stage framework is proposed: (1) classification network to identify class similarities, (2) optimization algorithm for generating semantic hash centers, and (3) deep hashing network trained on these centers for binary code generation. Result: On public datasets, SHC achieved improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics respectively over state-of-the-art methods. Conclusion: SHC, a new deep hashing framework, significantly enhances image retrieval performance by generating semantic hash centers that preserve semantic relationships among classes. Abstract: Deep hashing is an effective approach for large-scale image retrieval. Current methods are typically classified by their supervision types: point-wise, pair-wise, and list-wise. Recent point-wise techniques (e.g., CSQ, MDS) have improved retrieval performance by pre-assigning a hash center to each class, enhancing the discriminability of hash codes across various datasets. However, these methods rely on data-independent algorithms to generate hash centers, which neglect the semantic relationships between classes and may degrade retrieval performance. This paper introduces the concept of semantic hash centers, building on the idea of traditional hash centers. We hypothesize that hash centers of semantically related classes should have closer Hamming distances, while those of unrelated classes should be more distant. To this end, we propose a three-stage framework, SHC, to generate hash codes that preserve semantic structure. First, we develop a classification network to identify semantic similarities between classes using a data-dependent similarity calculation that adapts to varying data distributions. Second, we introduce an optimization algorithm to generate semantic hash centers, preserving semantic relatedness while enforcing a minimum distance between centers to avoid excessively similar hash codes. Finally, a deep hashing network is trained using these semantic centers to convert images into binary hash codes. Experimental results on large-scale retrieval tasks across several public datasets show that SHC significantly improves retrieval performance. Specifically, SHC achieves average improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics, respectively, over state-of-the-art methods.

Shijun Yang,Xiang Zhang,Wanqing Zhao,Hangzai Luo,Sheng Zhong,Jinye Peng,Jianping Fan

Main category: cs.CV

TL;DR: This paper introduces MuGCP, a new method for conditional prompt generation that uses Multi-modal Large Language Models and an Attention Mutual-Guidance module to improve the performance of Vision-Language Models on multi-modal tasks.

Details

Motivation: Prompt learning faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders. Method: We introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP). We also introduce the Attention Mutual-Guidance (AMG) module and a Multi-Prompt Fusion (MPF) mechanism. Result: MuGCP ensures effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs). It enhances the model's performance in multi-modal tasks and improves the modeling of class embeddings and instance-specific knowledge. Conclusion: MuGCP outperforms existing state-of-the-art methods on 14 different datasets. Abstract: Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level semantic knowledge for image instances. To ensure effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs), we introduce the Attention Mutual-Guidance (AMG) module, which facilitates interactions between visual and semantic information. Through mutual guidance, the AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that integrates SCP and VCP with contextual prompts, ensuring seamless coordination among the different prompts and enhancing the modeling of class embeddings and instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art methods on 14 different datasets. The code will be made available after publication.

[90] InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes

Zesong Yang,Bangbang Yang,Wenqi Dong,Chenxuan Cao,Liyuan Cui,Yuewen Ma,Zhaopeng Cui,Hujun Bao

Main category: cs.CV

TL;DR: This paper proposes InstaScene for holistic 3D perception, achieving accurate decomposition and complete reconstruction of complex scenes by combining spatial contrastive learning and in-situ generation.

Details

Motivation: The motivation is to enable robotics with the human-like ability to identify and mentally complete occluded objects in cluttered environments, which current advanced reconstruction techniques fail to achieve due to modeling scenes as undifferentiated wholes. Method: InstaScene uses spatial contrastive learning by tracing rasterization of each instance across views to enhance semantic supervision. It also introduces in-situ generation to reconstruct complete instances by leveraging observations and geometric cues. Result: Experiments show that the method excels in scene decomposition and object completion in both complex real-world and synthetic scenes. Conclusion: The paper concludes that InstaScene achieves superior decomposition accuracy while generating geometrically faithful and visually intact objects, demonstrating its effectiveness in holistic 3D perception of complex scenes. Abstract: Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.

[91] Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Wongi Jeong,Kyungryeol Lee,Hoigi Seo,Se Young Chun

Main category: cs.CV

TL;DR: RALU accelerates diffusion transformer inference by optimizing spatial computation through mixed-resolution sampling and noise-timestep rescheduling, achieving significant speed improvements without sacrificing quality.

Details

Motivation: Diffusion transformers face deployment challenges due to heavy computation; existing acceleration methods focus on temporal dimensions, but spatial acceleration remains underexplored. Method: Region-Adaptive Latent Upsampling (RALU) performs mixed-resolution sampling across three stages: low-resolution denoising latent diffusion, region-adaptive upsampling on artifact-prone regions, and full-resolution detail refinement, combined with noise-timestep rescheduling for stability. Result: RALU achieves up to 7.0× speed-up on FLUX and 3.0× on Stable Diffusion 3 with minimal degradation in image quality. Conclusion: RALU is a training-free framework that accelerates inference along the spatial dimension, significantly reducing computation while preserving image quality and complementing existing temporal acceleration methods. Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

[92] RePaintGS: Reference-Guided Gaussian Splatting for Realistic and View-Consistent 3D Scene Inpainting

Ji Hyun Seo,Byounhyun Yoo,Gerard Jounghyun Kim

Main category: cs.CV

TL;DR: This paper introduces a novel 3D scene inpainting technique that ensures realistic and consistent results by using a reference view to guide optimization.

Details

Motivation: Existing image inpainting methods produce inconsistent results across viewpoints, leading to unnatural appearances when removing objects in radiance field methods. Method: Estimates inpainting similarity across views to adjust their contribution, constructs geometry tailored to the reference view, and warps the reference inpainting to other views as pseudo-ground truth for optimization. Result: Comparative evaluations show improved geometric fidelity and perceptual consistency in inpainted 3D scenes, even for complex cases. Conclusion: The proposed 3D scene inpainting method effectively improves geometric fidelity and appearance consistency in complex scenes by leveraging a reference view. Abstract: Radiance field methods, such as Neural Radiance Field or 3D Gaussian Splatting, have emerged as seminal 3D representations for synthesizing realistic novel views. For practical applications, there is ongoing research on flexible scene editing techniques, among which object removal is a representative task. However, removing objects exposes occluded regions, often leading to unnatural appearances. Thus, studies have employed image inpainting techniques to replace such regions with plausible content - a task referred to as 3D scene inpainting. However, image inpainting methods produce one of many plausible completions for each view, leading to inconsistencies between viewpoints. A widely adopted approach leverages perceptual cues to blend inpainted views smoothly. However, it is prone to detail loss and can fail when there are perceptual inconsistencies across views. In this paper, we propose a novel 3D scene inpainting method that reliably produces realistic and perceptually consistent results even for complex scenes by leveraging a reference view. Given the inpainted reference view, we estimate the inpainting similarity of the other views to adjust their contribution in constructing an accurate geometry tailored to the reference. This geometry is then used to warp the reference inpainting to other views as pseudo-ground truth, guiding the optimization to match the reference appearance. Comparative evaluation studies have shown that our approach improves both the geometric fidelity and appearance consistency of inpainted scenes.

[93] Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Anlin Zheng,Xin Wen,Xuanyang Zhang,Chuofan Ma,Tiancai Wang,Gang Yu,Xiangyu Zhang,Xiaojuan Qi

Main category: cs.CV

TL;DR: This paper introduces VFMTok, an image tokenizer built atop pre-trained vision foundation models, achieving significant improvements in image generation and reconstruction with enhanced efficiency.

Details

Motivation: The motivation is to explore a novel direction by leveraging pre-trained vision foundation models for building an image tokenizer, which is an underexplored area. Method: The method involves using a frozen vision foundation model as the encoder of the tokenizer, incorporating a region-adaptive quantization framework and a semantic reconstruction objective to improve effectiveness. Result: The result is improved image reconstruction and generation quality, enhanced token efficiency, faster model convergence, and high-fidelity class-conditional synthesis without needing classifier-free guidance (CFG). VFMTok achieves a gFID of 2.07 on ImageNet benchmarks. Conclusion: VFMTok, the proposed image tokenizer based on pre-trained vision foundation models, achieves substantial improvements in image reconstruction and generation quality while enhancing token efficiency. Abstract: Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

[94] Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT

Wei Zhang,Yihang Wu,Songhua Li,Wenjie Ma,Xin Ma,Qiang Li,Qi Wang

Main category: cs.CV

TL;DR: This paper surveys feed-forward deep learning models like DUSt3R that revolutionize 3D reconstruction by efficiently inferring camera poses and dense geometry from images, offering a contrast to traditional methods and identifying future research directions.

Details

Motivation: Traditional methods for 3D reconstruction, such as SfM and MVS, face limitations in workflow complexity, computational cost, and robustness. Deep learning-based feed-forward models offer a paradigm shift, prompting the need to understand and explore this emerging domain. Method: The paper systematically reviews emerging feed-forward deep learning models for 3D reconstruction, analyzing their technical frameworks, comparing them with traditional and earlier learning-based methods, and discussing datasets, evaluation metrics, and future opportunities. Result: A comprehensive survey is provided on feed-forward models for 3D reconstruction, highlighting their use of Transformer-based correspondence modeling, joint pose and geometry regression, scalability strategies, and potential for advancing various applications. Conclusion: The paper concludes that the new feed-forward deep learning models, like DUSt3R, are transforming 3D reconstruction by enabling efficient and robust recovery of 3D structures from images, offering promising applications and posing new research challenges. Abstract: 3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology's broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.

[95] A document is worth a structured record: Principled inductive bias design for document recognition

Benjamin Meyer,Lukas Tuggener,Sascha Hänzi,Daniel Schmid,Erdal Ayfer,Benjamin F. Grewe,Ahmed Abdulkadir,Thilo Stadelmann

Main category: cs.CV

TL;DR: 这篇论文介绍了一种将文档识别视为从文档到记录的转录任务的新方法，强调利用文档的内在结构属性改进机器学习模型的设计。

Details

Motivation: 当前最先进的方法将文档识别视为单纯的计算机视觉问题，忽略了特定文档类型的基础结构特性，导致依赖次优的启发式后处理。 Method: 提出了一种用于设计特定结构归纳偏置的方法，并采用相应的基础Transformer架构来适应不同结构。 Result: 通过引入无限制图结构的归纳偏置，训练了首个能够成功端到端转录工程图纸的模型。 Conclusion: 该论文提出了一种基于内在结构的文档识别方法，为设计更有效的文档识别系统提供了新的视角和指导。 Abstract: Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.

[96] F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement

Seyedeh Sahar Taheri Otaghsara,Reza Rahmanzadeh

Main category: cs.CV

TL;DR: F3-Net是一种新型基础模型，解决了医学图像分割中的关键问题，具备强大的泛化能力和实用性。

Details

Motivation: 为了克服现有医学图像分割模型对完整多模态输入的依赖性、泛化能力有限以及任务特定性狭窄的问题，需要一种更实用的解决方案。 Method: 通过灵活的合成模态训练和零图像策略，F3-Net在缺少MRI序列的情况下仍能保持稳健性能，并采用统一架构实现多病理分割。 Result: F3-Net在多个数据集（如BraTS 2021、BraTS 2024和ISLES 2022）上表现出强大的领域适应性和临床异质性适应能力，取得了较高的Dice相似系数（DSC）。 Conclusion: F3-Net是一种通用的医学图像分割模型，能够有效应对临床医学图像分割中的多种挑战，具有广泛的应用前景。 Abstract: F3-Net is a foundation model designed to overcome persistent challenges in clinical medical image segmentation, including reliance on complete multimodal inputs, limited generalizability, and narrow task specificity. Through flexible synthetic modality training, F3-Net maintains robust performance even in the presence of missing MRI sequences, leveraging a zero-image strategy to substitute absent modalities without relying on explicit synthesis networks, thereby enhancing real-world applicability. Its unified architecture supports multi-pathology segmentation across glioma, metastasis, stroke, and white matter lesions without retraining, outperforming CNN-based and transformer-based models that typically require disease-specific fine-tuning. Evaluated on diverse datasets such as BraTS 2021, BraTS 2024, and ISLES 2022, F3-Net demonstrates strong resilience to domain shifts and clinical heterogeneity. On the whole pathology dataset, F3-Net achieves average Dice Similarity Coefficients (DSCs) of 0.94 for BraTS-GLI 2024, 0.82 for BraTS-MET 2024, 0.94 for BraTS 2021, and 0.79 for ISLES 2022. This positions it as a versatile, scalable solution bridging the gap between deep learning research and practical clinical deployment.

[97] Dual Dimensions Geometric Representation Learning Based Document Dewarping

Heng Li,Qingcai Chen,Xiangping Wu

Main category: cs.CV

TL;DR: 本文提出D2Dewarp，一种基于双维度感知的文档去扭曲方法，结合水平与垂直方向特征并利用新生成的大规模标注数据，在公开数据集上取得了更优性能。

Details

Motivation: 尽管已有方法在文本行感知上有所改进，但大多只关注单一水平方向，缺乏对文档整体二维变形的有效建模。 Method: 提出了一种双维度（水平和垂直）感知模型，结合X-Y坐标融合模块来增强不同方向间的特征交互与约束，并使用自动生成的标注数据进行训练。 Result: 在中英文公共基准数据集上的定量和定性结果均表明该方法优于当前最先进的文档去扭曲方法。 Conclusion: D2Dewarp通过利用文档的水平和垂直方向特征，实现了优于现有最先进方法的文档去扭曲效果，并且提出了一个自动细粒度标注的方法来生成大规模训练数据集。 Abstract: Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and an automatic rendering engine to build a new large-scale distortion training dataset. The code and dataset will be publicly released. On public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The dataset will be publicly available at https://github.com/xiaomore/DocDewarpHV

[98] Unified People Tracking with Graph Neural Networks

Martin Engilberge,Ivan Vrkic,Friedrich Wilke Grosche,Julien Pilet,Engin Turetken,Pascal Fua

Main category: cs.CV

TL;DR: 本文提出了一个全新的多人跟踪模型，结合动态时空图与场景特定信息，解决了遮挡问题，并发布了一个大规模数据集用于促进该领域的研究。

Details

Motivation: 为了解决传统多人跟踪方法对预计算轨迹片段的依赖以及在遮挡情况下性能下降的问题。 Method: 构建了一个能够聚合空间、上下文和时间信息的动态时空图，并引入场景特定信息以改善遮挡处理。 Result: 实验表明该模型在多个公共基准和新提出的数据集上均达到最先进的性能，并具有应对不同条件的灵活性。 Conclusion: 该论文提出了一种用于多人跟踪的统一、完全可微分模型，无需依赖预计算的轨迹片段即可将检测结果关联成轨迹，并且通过引入一个包含25个部分重叠视角的大规模数据集，实现了在公开基准测试和新数据集上的最先进性能，同时提升了遮挡处理能力。 Abstract: This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

[99] Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification

Yufei Zheng,Wenjun Wang,Wenjun Gan,Jiawei Liu

Main category: cs.CV

TL;DR: This paper introduces OGFR, a novel approach for occluded person re-identification that leverages reinforced knowledge distillation and vision transformers to address challenges like occlusion variability and feature contamination.

Details

Motivation: Existing methods for occluded person re-identification struggle with unseen occlusion scenarios and feature contamination from holistic images, which OGFR aims to overcome. Method: OGFR uses a teacher-student distillation architecture with an Occlusion-Aware Vision Transformer and a Feature Erasing and Purification Module, incorporating reinforced knowledge distillation to purify and transfer holistic knowledge. Result: The experimental results demonstrate that OGFR enables the student branch to learn robust representations by effectively absorbing purified holistic knowledge, improving performance in occluded person re-identification. Conclusion: The proposed OGFR method effectively addresses the challenges of handling diverse occlusion scenarios and feature contamination in occluded person re-identification. Abstract: Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.

[100] RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features

Inye Na,Nejung Rue,Jiwon Chung,Hyunjin Park

Main category: cs.CV

TL;DR: RadiomicsRetrieval是一种用于医学图像检索的新颖3D框架，它将手工放射组学描述符与深度学习嵌入相结合，以提高检索的灵活性和准确性。

Details

Motivation: 当前的医学图像检索方法主要支持2D图像并需要完全标注的查询，限制了临床灵活性。因此，需要一种更灵活的方法来充分利用医学图像中的空间上下文信息。 Method: 利用可提示的分割模型（如SAM）生成肿瘤特异性图像嵌入，并通过对比学习与从相同肿瘤中提取的放射组学特征对齐。这些表示进一步通过解剖位置嵌入（APE）得到丰富。 Result: 在肺部CT和脑MRI公共数据集上的实验表明，放射组学特征显著提高了检索特异性，而APE为基于位置的搜索提供了全局解剖上下文。此外，该框架只需要最少的用户提示（例如单个点），从而最小化了分割开销。 Conclusion: RadiomicsRetrieval是一个结合手工放射组学描述符和基于深度学习的嵌入的3D内容检索框架，为诊断、治疗计划和大规模医学影像库研究提供了潜在益处。 Abstract: Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at https://github.com/nainye/RadiomicsRetrieval.

[101] SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2

Alen Adamyan,Tomáš Čížek,Matej Straka,Klara Janouskova,Martin Schmid

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的方法来优化SAM 2模型中的内存更新，以提升视频对象跟踪的性能。相比传统手工设计的更新规则，该方法在过拟合设置下实现了超过三倍的增益提升。

Details

Motivation: 现有的SAM 2模型通过手工设计的更新规则处理干扰、遮挡和物体运动等问题，但这些规则存在局限性，因此需要一种更有效的替代方案。 Method: 将内存控制建模为一个序贯决策问题，并采用强化学习进行优化，在每个视频上分别训练独立的智能体以实现更精准的记忆更新策略。 Result: 在过拟合设置下，该方法相对于SAM 2取得了超过现有启发式方法三倍以上的相对增益提升，证明了记忆库的潜在能力以及强化学习在内存控制中的优势。 Conclusion: 研究表明，与手工设计的更新规则相比，使用强化学习优化记忆更新是一种更具潜力的替代方法，为未来视觉对象跟踪领域提供了新的方向。 Abstract: Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks and has become the state-of-the-art for visual object tracking. The model stores information from previous frames in a memory bank, enabling temporal consistency across video sequences. Recent methods augment SAM 2 with hand-crafted update rules to better handle distractors, occlusions, and object motion. We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2 by framing memory control as a sequential decision-making problem. In an overfitting setup with a separate agent per video, our method achieves a relative improvement over SAM 2 that exceeds by more than three times the gains of existing heuristics. These results reveal the untapped potential of the memory bank and highlight reinforcement learning as a powerful alternative to hand-crafted update rules for memory control in visual object tracking.

[102] Image Translation with Kernel Prediction Networks for Semantic Segmentation

Cristina Mata,Michael S. Ryoo,Henrik Turbell

Main category: cs.CV

TL;DR: The paper proposes DA-KPN, an unpaired image translation method that ensures semantic consistency, improving the performance of semantic segmentation models trained on synthetic data.

Details

Motivation: Semantic segmentation typically requires many pixel-wise annotations, which are difficult to obtain in real-world data. Practitioners often use synthetic datasets, but this introduces a domain gap. Existing unpaired image translation methods using GANs do not ensure semantic consistency, which affects segmentation performance. Method: Domain Adversarial Kernel Prediction Network (DA-KPN) estimates pixel-wise input transformation parameters of a lightweight translation function and uses multi-scale discriminators to distinguish between translated and target samples, guaranteeing semantic matching between synthetic labels and translations. Result: DA-KPN guarantees semantic matching and generates more realistic training data compared to existing GAN-based methods, leading to improved performance on semantic segmentation tasks when real image labels are limited. Conclusion: DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing. Abstract: Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.

[103] Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

Enyu Liu,En Yu,Sijia Chen,Wenbing Tao

Main category: cs.CV

TL;DR: DISC improves 3D semantic scene completion using class-level learning, achieving top performance on major benchmarks with significant gains in instance understanding.

Details

Motivation: Traditional methods focus on voxel-level features, limiting the use of crucial class-level information which is vital for detailed scene completion. Method: The method uses a dual-stream approach, replacing voxel queries with discriminative class queries, incorporating class-specific geometric and semantic priors, and designing specialized decoding modules for efficient class-level interaction. Result: DISC achieves SOTA performance on SemanticKITTI and SSCBench-KITTI-360 benchmarks with mIoU scores of 17.35 and 20.55, significantly outperforming existing methods in instance category performance. Conclusion: DISC is a novel dual-stream paradigm that enhances 3D Semantic Scene Completion by leveraging class-level information, achieving state-of-the-art results on major benchmarks. Abstract: 3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.

Mingda Zhang,Kaiwen Pan

Main category: cs.CV

TL;DR: This paper proposes a multi-modal fusion framework for brain tumor segmentation using bidirectional interactive attention mechanisms, achieving improved accuracy and boundary delineation compared to existing methods.

Details

Motivation: This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Method: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. Result: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis. Abstract: This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.

[105] BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis

Shuang Cui,Jinglin Xu,Yi Li,Xiongxin Tang,Jiangmeng Li,Jiahuan Zhou,Fanjiang Xu,Fuchun Sun,Hui Xiong

Main category: cs.CV

TL;DR: BayesTTA addresses continual-temporal test-time adaptation by dynamically adapting visual representations and maintaining prediction consistency over time using a Bayesian framework.

Details

Motivation: Existing continual test-time adaptation methods are ineffective under gradual temporal distribution shifts (e.g., illumination or seasonal changes), as they assume sudden and severe shifts, leading to catastrophic forgetting, unreliable confidence estimates, and misaligned visual representations. Method: BayesTTA uses a Bayesian framework to estimate class-conditional Gaussian mixtures, adaptively selects covariance structures, and applies Gaussian discriminant analysis for calibrated inference. It supervises normalization layer adaptation through self-paced learning. Result: BayesTTA outperforms state-of-the-art methods on four temporally evolving datasets and generalizes well across ten standard TTA datasets, while maintaining computational efficiency. Conclusion: BayesTTA provides superior performance in continual-temporal test-time adaptation by maintaining temporally consistent predictions and dynamically aligning visual representations without storing raw data. Abstract: Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{https://github.com/cuishuang99/BayesTTA}{https://github.com/cuishuang99/BayesTTA}.

[106] NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Luke Rivard,Sun Sun,Hongyu Guo,Wenhu Chen,Yuntian Deng

Main category: cs.CV

TL;DR: NeuralOS 是一种神经框架，通过结合 RNN 和扩散模型，能够根据用户输入预测屏幕帧，实现逼真的 GUI 模拟与交互。

Details

Motivation: 旨在通过直接预测用户输入（如鼠标移动、点击和键盘事件）来响应，从而模拟操作系统的图形用户界面（GUI）。 Method: NeuralOS 结合了一个循环神经网络（RNN），用来跟踪计算机状态，以及一个基于扩散的神经渲染器，用来生成屏幕图像。 Result: 实验表明 NeuralOS 能够成功渲染逼真的 GUI 序列，准确捕捉鼠标交互，并可靠地预测诸如应用程序启动之类的状态转换。然而，对细粒度键盘交互的精确建模仍然具有挑战性。 Conclusion: NeuralOS 是一个用于模拟操作系统图形用户界面（GUI）的神经框架，它朝向创建完全自适应的生成神经接口迈出了重要一步。 Abstract: We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.

[107] Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates

Natalia Bottaioli,Solène Tarride,Jérémy Anger,Seginus Mowlavi,Marina Gardella,Antoine Tadros,Gabriele Facciolo,Rafael Grompone von Gioi,Christopher Kermorvant,Jean-Michel Morel,Javier Preciozzi

Main category: cs.CV

TL;DR: This study compares annotation strategies for improving the extraction of key-value information from handwritten Uruguayan birth certificates using the Document Attention Network (DAN), showing that the choice of annotation method depends on the type of information being extracted.

Details

Motivation: The research aims to extract key-value information from Uruguayan birth certificates that are handwritten in Spanish, focusing on minimizing training data and annotation effort. Method: The study investigates two annotation strategies for transcribing handwritten documents and evaluates the Document Attention Network (DAN) with minimal training data. Result: Experiments on two datasets showed varying effectiveness of annotation methods depending on the type of information: normalized annotation was better for standardized fields, while diplomatic annotation was superior for names and surnames. Conclusion: Normalized annotation is more effective for standardizable fields like dates and places of birth, whereas diplomatic annotation performs better for non-standardizable fields such as names and surnames. Abstract: This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.

[108] OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception

Junho Koh,Youngwoo Lee,Jungho Kim,Dongyoung Lee,Jun Won Choi

Main category: cs.CV

TL;DR: This paper proposes OnlineBEV, a new method for multi-view camera-based 3D perception that improves performance by effectively combining temporal BEV features through a recurrent structure and motion-guided alignment.

Details

Motivation: The motivation is to enhance multi-view camera-based 3D perception by overcoming limitations in temporal aggregation of BEV features caused by dynamic changes due to object motion. Method: The paper introduces OnlineBEV, a novel temporal 3D perception method using a recurrent structure to combine BEV features over time. It employs the Motion-guided BEV Fusion Network (MBFNet) for temporal feature alignment and uses Temporal Consistency Learning Loss to enforce alignment explicitly. Result: Experiments on the nuScenes benchmark demonstrate that OnlineBEV significantly outperforms existing methods, achieving state-of-the-art results with 63.9% NDS on the test set. Conclusion: OnlineBEV achieves significant performance gains over the current best method, SOLOFusion, and records state-of-the-art performance in camera-only 3D object detection on the nuScenes benchmark with 63.9% NDS. Abstract: Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.

[109] DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Haoran Sun,Haoyu Bian,Shaoning Zeng,Yunbo Rao,Xu Xu,Lin Mei,Jianping Gou

Main category: cs.CV

TL;DR: 提出了一种名为DatasetAgent的新方法，通过多代理协作系统将现实世界图像转化为高质量图像数据集，适用于多种视觉任务。

Details

Motivation: 传统手动收集和标注图像数据集的方法效率低下，而尽管大型模型可通过数据生成提供解决方案，现实世界的数据在构建图像数据集中显然更有价值。 Method: 使用配备了多模态大语言模型（MLLMs）的四个不同代理和一个用于图像优化的工具包，进行现实世界图像到数据集的转换。 Result: 进行了两类实验：扩展现有数据集和从头创建新数据集，并成功利用DatasetAgent构建的多个图像数据集训练了多种计算机视觉模型。 Conclusion: DatasetAgent是一个能够根据用户需求自动构建高质量图像数据集的新方法，通过多代理协作系统实现。 Abstract: Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.

[110] Generalizable 7T T1-map Synthesis from 1.5T and 3T T1 MRI with an Efficient Transformer Model

Zach Eidex,Mojtaba Safari,Tonghe Wang,Vanessa Wildman,David S. Yu,Hui Mao,Erik Middlebrooks,Aparna Kesewala,Xiaofeng Yang

Main category: cs.CV

TL;DR: 提出了一种高效的基于transformer的模型（7T-Restormer），可以从常规1.5T或3T MRI图像中合成7T质量的T1图，且优于现有技术。

Details

Motivation: 超高场7T MRI虽然提供更好的分辨率和对比度，但存在成本高、易受磁化率伪影影响等问题。为了克服这些问题并提高7T MRI在临床工作流程中的可及性，提出了7T-Restormer模型。 Method: 该模型使用了105个训练案例（包括19,204个切片）进行训练，并使用19个验证案例（3,476个切片）和17个测试案例（3,145个切片）进行验证和测试。此外，模型与ResViT和ResShift模型进行了比较。 Result: 7T-Restormer模型在1.5T输入下实现了PSNR为26.0 +/- 4.6 dB，SSIM为0.861 +/- 0.072，NMSE为0.019 +/- 0.011；在3T输入下分别达到25.9 +/- 4.9 dB和0.866 +/- 0.077。相较于ResShift和ResViT模型，NMSE分别降低了64%和41%。 Conclusion: 7T-Restormer是一种能够从1.5T和3T T1W扫描中预测高质量7T MP2RAGE图的新方法，其性能优于当前最先进的方法。 Abstract: Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.

[111] ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way

Rajarshi Roy,Devleena Das,Ankesh Banerjee,Arjya Bhattacharjee,Kousik Dasgupta,Subarna Tripathi

Main category: cs.CV

TL;DR: ByDeWay enhances Multimodal Large Language Models using a depth-aware prompting strategy that improves spatial reasoning without training, resulting in more accurate and less hallucinated responses.

Details

Motivation: The motivation is to improve spatial reasoning and grounding in MLLMs without modifying their parameters, aiming for a lightweight and modular solution compatible with black-box systems. Method: ByDeWay uses Layered-Depth-Based Prompting (LDP) to segment scenes into layers and generate region-specific captions with a grounded vision-language model, appended to the image-question prompt. Result: Experiments show consistent improvements across multiple MLLMs on hallucination-sensitive and reasoning-intensive benchmarks, validating the effectiveness of the zero-training depth-aware prompting approach. Conclusion: ByDeWay effectively enhances the performance of MLLMs by utilizing a novel prompting strategy that incorporates spatial context through depth-aware captions, leading to more grounded responses. Abstract: We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.

Debashis Gupta,Aditi Golder,Rongkhun Zhu,Kangning Cui,Wei Tang,Fan Yang,Ovidiu Csillik,Sarra Alaqahtani,V. Paul Pauca

Main category: cs.CV

TL;DR: 本文提出了MoSAiC，一种用于多模态卫星图像的统一对比学习框架，在低标签和高类别重叠情况下提升了表征学习效果。

Details

Motivation: 对比学习在计算机视觉中表现出色，但在地球系统观测中面临高类间相似性、场景杂乱和模糊边界等挑战，现有方法对多标签对齐和跨模态语义精确性支持不足。 Method: 引入了MoSAiC框架，结合了模态内与模态间的对比学习，并采用多标签监督对比损失进行优化。 Result: 在BigEarthNet V2.0和Sent12MS两个数据集上的实验表明，MoSAiC在准确率、聚类一致性和低标签及高类别重叠情景下的泛化能力均优于现有方法。 Conclusion: MoSAiC为多模态卫星图像设计了一个统一的框架，通过联合优化模态内和模态间对比学习以及多标签监督对比损失，实现了更精细的语义解耦和更强大的表征学习。 Abstract: Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.

[113] An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan

Mengyuan Liu,Jeongkyu Lee

Main category: cs.CV

TL;DR: 本研究开发了一种无需训练的MRI肌肉分割方法，通过关键点跟踪和光流技术，在保持高精度的同时显著降低了计算需求并提高了可解释性。

Details

Motivation: 现有的基于CNN的方法存在计算开销大、泛化能力有限以及在不同人群中的可解释性差的问题。 Method: 结合关键点选择与Lucas-Kanade光流法进行训练自由的分割。 Result: 提出的模型在不同关键点选择策略下取得了0.6到0.7之间的平均Dice相似系数（DSC），表现与当前最先进的CNN模型相当。 Conclusion: 该研究提出了一种基于关键点跟踪的MRI肌肉分割新方法，具有良好的性能、较低的计算需求和更高的可解释性。 Abstract: Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.

[114] L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training

Li Li,Yingzhe Peng,Xu Yang,Ruoxi Cheng,Haiyang Xu,Ming Yan,Fei Huang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的 CLIP 模型 L-CLIP 和对应的字幕评价指标 L-CLIPScore，其在保证多模态对齐性能的同时显著降低了计算成本，并展示了其在字幕评估和训练中的有效性。

Details

Motivation: 为了解决传统 CLIP 在评估和训练字幕模型时计算资源消耗大、效率低的问题，作者提出了一种更轻量且高效的替代方案。 Method: 提出了一种名为 L-CLIPScore 的新度量方法，基于一个经过压缩和蒸馏的轻量级 CLIP（L-CLIP）。压缩采用了权重复用和矩阵分解技术，而蒸馏则引入了新的多模态相似性调节器（SR）损失函数，以增强视觉语言对齐知识的迁移。 Result: L-CLIP 在多模态对齐能力上与原始 CLIP 表现相当，但所需计算资源和运行时间更少；L-CLIPScore 可有效用于评估字幕质量，并在与 n-gram 指标结合使用时能成功指导字幕模型训练。 Conclusion: L-CLIPScore 是一种高效的基于嵌入的字幕评价指标，同时可用于训练字幕模型。通过使用轻量级 L-CLIP，并结合 n-gram 指标进行模型训练，该方法在计算资源和运行时间方面更加高效，同时保持了与原始 CLIP 相当的多模态对齐能力。 Abstract: We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.

[115] SGPMIL: Sparse Gaussian Process Multiple Instance Learning

Andreas Lolos,Stergios Christodoulidis,Maria Vakalopoulou,Jose Dolz,Aris Moustakas

Main category: cs.CV

TL;DR: The paper introduces SGPMIL, a probabilistic attention-based MIL framework using Sparse Gaussian Processes, that addresses the lack of uncertainty quantification in instance-level attention scores while preserving bag-level performance.

Details

Motivation: Deterministic attention-based MIL approaches achieve strong bag-level performance but often overlook the uncertainty inherent in instance relevance. This paper aims to address the lack of uncertainty quantification in instance-level attention scores. Method: Introducing SGPMIL, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP), which learns a posterior distribution over attention scores. Result: SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps; it preserves competitive bag-level performance and significantly improves the quality and interpretability of instance-level predictions under uncertainty. Conclusion: SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Abstract: Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbf{SGPMIL}, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.

[116] Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine

Kongwu Huang,Shiyi Mu,Jun Jiang,Yuan Gao,Shugong Xu

Main category: cs.CV

TL;DR: 本文提出 Great-X 多模态仿真平台和 Great-MSD 数据集，推动低空无人机 3D 定位研究。

Details

Motivation: 为了探索缩放定律在 ISAC 研究中的潜力，需要高效的多模态数据仿真平台和相关数据集。 Method: 构建了一个基于 Unreal Engine 的多模态数据孪生平台 Great-X，并提出了一个基于 CSI 的无人机 3D 定位算法。 Result: 成功开发了 Great-X 平台和 Great-MSD 数据集，并验证了 CSI 定位算法的可行性与泛化能力。 Conclusion: Great-X 平台和 Great-MSD 数据集为 ISAC 研究提供了有效的工具和数据支持，促进了低空无人机 3D 定位技术的发展。 Abstract: Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: https://github.com/hkw-xg/Great-MCD.

[117] RoundaboutHD: High-Resolution Real-World Urban Environment Benchmark for Multi-Camera Vehicle Tracking

Yuqiang Lin,Sam Lockyer,Mingxuan Sui,Li Gan,Florian Stanek,Markus Zarbock,Wenbin Li,Adrian Evans,Nic Zhang

Main category: cs.CV

TL;DR: 本文介绍了RoundaboutHD，这是一个为多摄像头车辆跟踪任务设计的高分辨率数据集，旨在解决现有数据集的局限性并支持智能城市相关应用的研究。

Details

Motivation: 当前公开可用的多摄像头车辆跟踪数据集存在过于简单化的场景、低分辨率画面以及不够多样化的条件等限制，导致学术研究与现实应用之间存在较大差距。 Method: 通过四个非重叠的高分辨率摄像头捕获总共40分钟的标记视频片段，并提供车辆检测、单摄像头跟踪、基于图像的车辆重新识别和多摄像头跟踪的基线结果。 Result: 推出了RoundaboutHD，一个包含512个唯一车辆身份标注的高分辨率数据集，提供了时间一致性视频片段和支持进一步分析的车辆模型及相机建模信息。 Conclusion: RoundaboutHD是一个全面的、高分辨率的多摄像头车辆跟踪基准数据集，用于弥合学术研究与现实世界场景之间的差距。 Abstract: The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: https://github.com/siri-rouser/RoundaboutHD.git

[118] Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study

Anna Rosenberg,John Kennedy,Zohar Keidar,Yehoshua Y. Zeevi,Guy Gilboa

Main category: cs.CV

TL;DR: This paper proposes the use of ensembles of weak learners based on STV features to overcome the challenge of insufficient training data in computer vision, particularly showing effectiveness in predicting high uptake in PET using CT data.

Details

Motivation: The motivation was to address the lack of sufficient training data in solving computer vision problems through machine learning. Method: Ensembles of weak learners based on spectral total-variation (STV) features were designed and tested on a medical imaging dataset consisting of 457 scans with 1524 unique pairs of registered CT and PET slices. Result: STV learners outperformed neural nets and Radiomics features with an AUC of 0.87, indicating their effectiveness in predictive analysis for medical imaging. Conclusion: The study concludes that ensembles of weak learners based on STV features perform best in predicting high uptake in PET using CT data, compared to deep-learning methods and Radiomics features. Abstract: Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.

[119] HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer

Tianlong Ai,Tianzhu Liu,Haochen Jiang,Yanfeng Gu

Main category: cs.CV

TL;DR: 本文提出了一种新的分层解释范式HieraRS及跨领域迁移框架TransLU，以解决现有深度学习方法在遥感图像多粒度分类和跨领域迁移中的局限性。

Details

Motivation: 现有的深度学习方法在处理遥感图像的多层次语义粒度分类时存在局限性，且在跨领域的模型迁移上表现不佳。 Method: 引入了双向分层一致性约束机制（BHCCM）和跨领域双分支转移框架TransLU，包括跨领域知识共享（CDKS）和跨领域语义对齐（CDSA）。 Result: 提出了HieraRS和TransLU方法，并构建了一个大规模多模态分层土地利用数据集MM-5B。 Conclusion: 本文提出了一种新的分层解释范式HieraRS，用于实现多粒度预测，并支持将LCLU模型有效地转移到具有异构树状层次结构的跨领域任务中。 Abstract: Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: https://github.com/AI-Tianlong/HieraRS.

[120] Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection

Rei Tamaru,Pei Li,Bin Ran

Main category: cs.CV

TL;DR: Geo-ORBIT是一种结合实时车道检测、元学习和联邦学习的交通系统数字孪生框架，实现了高效、可扩展和隐私保护的动态建模。

Details

Motivation: 现有的交通系统数字孪生方法依赖静态地图或昂贵的传感器，且在隐私、通信和计算效率方面面临挑战，Geo-ORBIT旨在提供一种更具可扩展性和适应性的解决方案。 Method: Geo-ORBIT结合了实时车道检测（GeoLane）、个性化元学习（Meta-GeoLane）和联邦元学习（FedMeta-GeoLane），并与CARLA和SUMO集成，以创建高保真的数字孪生系统。 Result: FedMeta-GeoLane在多样化的城市场景实验中表现优异，几何误差更低，泛化能力更强，并显著减少了通信开销。 Conclusion: Geo-ORBIT框架为数字孪生中的道路几何感知提供了一个高效、可扩展和隐私保护的解决方案，展示了在交通管理和操作中实现高保真动态建模的潜力。 Abstract: Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at https://github.com/raynbowy23/FedMeta-GeoLane.git.

[121] Compress Any Segment Anything Model (SAM)

Juntong Fan,Zhiwei Hao,Jianqiang Shen,Shang-Ling Jui,Yi Zhang,Jing-Xiao Liao,Feng-Lei Fan

Main category: cs.CV

TL;DR: Birkhoff는 데이터 없는 압축 알고리즘으로 SAM과 그 변형 모델을 효과적으로 압축하여 높은 압축률과 빠른 추론 속도를 달성한다.

Details

Motivation: Segment Anything Model(SAM) 및 그 변형이 다양한 분야에서 널리 사용되면서, 이 모델들을 효과적으로 압축하는 것이 점점 더 중요해지고 있다. Method: Birkhoff는 하이퍼-압축이라는 새로운 압축 알고리즘을 도입하여 고차원 매개변수 벡터를 저차원 스칼라로 변환하고, 복원과 행렬 곱셈을 결합한 HyperLinear 연산자를 설계하여 추론 속도를 가속화한다. Result: COCO, LVIS, SA-1B 데이터셋의 18개 SAM에서 실시된 실험 결과, Birkhoff는 압축 시간, 압축률, 압축 후 성능, 추론 속도 측면에서 일관되고 경쟁력 있는 성능을 보였다. 예를 들어, 모든 모델에 대한 압축 시간은 60초 이내이며, SAM2-B에서는 5.17배의 압축률을 달성했고, 1% 미만의 성능 저하가 발생했다. Conclusion: Birkhoff는 다양한 모델 유형에 적용 가능한 유연성, 신속한 배포, 원본 모델 충실도, 소형 모델 크기 등의 장점을 가진 SAM 및 그 변형 모델용 데이터 없는 압축 알고리즘이다. Abstract: Due to the excellent performance in yielding high-quality, zero-shot segmentation, Segment Anything Model (SAM) and its variants have been widely applied in diverse scenarios such as healthcare and intelligent manufacturing. Therefore, effectively compressing SAMs has become an increasingly pressing practical need. In this study, we propose Birkhoff, a novel data-free compression algorithm for SAM and its variants. Unlike quantization, pruning, distillation, and other compression methods, Birkhoff embodies versatility across model types, agility in deployment, faithfulness to the original model, and compactness in model size. Specifically, Birkhoff introduces a novel compression algorithm: Hyper-Compression, whose core principle is to find a dense trajectory to turn a high-dimensional parameter vector into a low-dimensional scalar. Furthermore, Birkhoff designs a dedicated linear layer operator, HyperLinear, to fuse decompression and matrix multiplication to significantly accelerate inference of the compressed SAMs. Extensive experiments on 18 SAMs in the COCO, LVIS, and SA-1B datasets show that Birkhoff performs consistently and competitively in compression time, compression ratio, post-compression performance, and inference speed. For example, Birkhoff can achieve a compression ratio of 5.17x on SAM2-B, with less than 1% performance drop without using any fine-tuning data. Moreover, the compression is finished within 60 seconds for all models.

[122] A Hybrid Multi-Well Hopfield-CNN with Feature Extraction and K-Means for MNIST Classification

Ahmed Farooq

Main category: cs.CV

TL;DR: A hybrid model combining CNNs with a multi-well Hopfield network achieves 99.2% accuracy on MNIST by extracting deep features and using them in an interpretable energy-based classification process.

Details

Motivation: The motivation behind this study is to develop a model that effectively handles intraclass variability in handwritten digits while maintaining interpretability through an energy-based decision framework. Method: The method involves using a CNN to extract features from MNIST images, followed by k-means clustering to generate class-specific prototypes. These prototypes are used in a multi-well Hopfield network, which performs classification by minimizing an energy function balancing feature similarity and class assignment. Result: Through systematic optimization of the CNN architecture and number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images. Conclusion: The study concludes that the hybrid model combining CNNs with a multi-well Hopfield network is effective for image classification tasks, achieving high test accuracy on MNIST and demonstrating robustness to intraclass variability. Abstract: This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class assignment.The model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.

[123] From One to More: Contextual Part Latents for 3D Generation

Shaocong Dong,Lihe Ding,Xiao Chen,Yaokun Li,Yuxin Wang,Yucheng Wang,Qi Wang,Jaehyeok Kim,Chenjian Gao,Zhanpeng Huang,Zibin Wang,Tianfan Xue,Dan Xu

Main category: cs.CV

TL;DR: CoPart 是一种新的3D生成框架，通过部分感知和关系建模，实现了高质量的多部分3D对象生成与编辑。

Details

Motivation: 当前3D生成方法存在三个主要问题：单潜在表示无法保持复杂多部分几何细节，整体编码忽视部分独立性和关联性，全局条件机制缺乏细粒度控制。 Method: CoPart 利用部分潜在编码和相互引导策略，结合预训练扩散模型进行联合部分潜在去噪，并构建了大规模数据集 Partverse 进行训练。 Result: CoPart 在部分级编辑、关节物体生成和场景组合任务中表现出卓越的可控性和生成质量，尤其是在处理复杂多部分结构时优于现有方法。 Conclusion: CoPart 提出了一种部分感知的扩散框架，通过分解3D对象并建模其部分关系，解决了现有3D生成方法在细节保持、部件独立性和可控性方面的不足。 Abstract: Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

[124] CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

Zhengqing Wang,Yuefan Wu,Jiacheng Chen,Fuyang Zhang,Yasutaka Furukawa

Main category: cs.CV

TL;DR: 本文提出了一种名为CLiFT的神经渲染方法，通过压缩的光场令牌实现高效、灵活的场景表示与渲染。

Details

Motivation: 传统方法在处理复杂场景时往往面临高计算成本和大存储需求，因此需要一种能够高效表示场景并适应不同计算预算的神经渲染方法。 Method: 提出了一种基于神经渲染的方法，利用“压缩光场令牌（CLiFTs）”来表示场景。首先通过多视角编码器对图像进行标记，然后使用潜空间K-means聚类生成中心令牌，再通过多视角“压缩器”构建CLiFTs。测试时，根据目标视图和计算预算选择相应数量的令牌并合成新视图。 Result: 实验表明，该方法在RealEstate10K和DL3DV数据集上取得了最好的渲染评分，在保证渲染质量的同时显著减少了数据量，并且能够在不同计算资源条件下灵活调整性能表现。 Conclusion: CLiFT实现了高效的场景表示和渲染，通过压缩的光场令牌减少数据量同时保持高质量的渲染结果，并提供了数据大小、渲染质量和速度之间的权衡能力。 Abstract: This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

[125] Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Hangjie Yuan,Weihua Chen,Jun Cen,Hu Yu,Jingyun Liang,Shuning Chang,Zhihui Lin,Tao Feng,Pengwei Liu,Jiazheng Xing,Hao Luo,Jiasheng Tang,Fan Wang,Yi Yang

Main category: cs.CV

TL;DR: This paper proposes Lumos-1, an efficient autoregressive video generator based on modified LLM architecture, achieving competitive performance with state-of-the-art models.

Details

Motivation: Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency. The authors aim to develop a more efficient and effective autoregressive video generation method. Method: The paper introduces 3D RoPE and MM-RoPE for modeling multimodal spatiotemporal data, uses a token dependency strategy for intra-frame bidirectionality and inter-frame temporal causality, and proposes Autoregressive Discrete Diffusion Forcing (AR-DF) to solve frame-wise loss imbalance. Result: Lumos-1 was pre-trained on only 48 GPUs and achieved performance comparable to EMU3, COSMOS-Video2World, and OpenSoraPlan on multiple benchmarks. Conclusion: Lumos-1 is an autoregressive video generator that retains the LLM architecture with minimal modifications and achieves performance comparable to other state-of-the-art models. Abstract: Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Table of Contents

cs.CL [Back]

[1] RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

[2] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

[3] Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

[4] Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications

[5] Mechanistic Indicators of Understanding in Large Language Models

[6] Review, Remask, Refine (R3): Process-Guided Block Diffusion for Text Generation

[7] Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks

[8] Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

[9] Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

[10] "Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs

[11] Better Together: Quantifying the Benefits of AI-Assisted Recruitment

[12] A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

[13] Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding

[14] Integrating External Tools with Large Language Models to Improve Accuracy

[15] Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights

[16] CRISP: Complex Reasoning with Interpretable Step-based Plans

[17] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

[18] Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

[19] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

[20] Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

[21] Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

[22] Distilling Empathy from Large Language Models

[23] TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs

[24] Simple Mechanistic Explanations for Out-Of-Context Reasoning

[25] Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

[26] Exploring Gender Differences in Chronic Pain Discussions on Reddit

[27] KAT-V1: Kwai-AutoThink Technical Report

[28] Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

[29] CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

[30] MK2 at PBIG Competition: A Prompt Generation Solution

[31] Distillation versus Contrastive Learning: How to Train Your Rerankers

[32] What Factors Affect LLMs and RLLMs in Financial Question Answering?

[33] Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

[34] Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

[35] The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality

[36] A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

[37] ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains

[38] Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

[39] Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework

[40] Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study

[41] ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition

[42] Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach

[43] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

[44] LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

[45] Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop

[46] PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

[47] The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks

[48] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

[49] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

[50] The Impact of Automatic Speech Transcription on Speaker Attribution

[51] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

[52] KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

[53] Multilingual Multimodal Software Developer for Code Generation

[54] KV Cache Steering for Inducing Reasoning in Small Language Models

cs.CV [Back]

[55] CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025

[56] Self-Consistency in Vision-Language Models for Precision Agriculture: Multi-Response Consensus for Crop Disease Management

[57] Development of a Canada-Wide Morphology Map for the ITU-R P. 1411 Propagation Model

[58] Towards Evaluating Robustness of Prompt Adherence in Text to Image Models

[59] ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

[60] A Hybrid Multilayer Extreme Learning Machine for Image Classification with an Application to Quadcopters

[61] Lightweight Cloud Masking Models for On-Board Inference in Hyperspectral Imaging

[62] The relative importance of being Gaussian

[63] An Object-Based Deep Learning Approach for Building Height Estimation from Single SAR Images

[64] RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration

[65] Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

[66] Adaptive Diffusion Denoised Smoothing : Certified Robustness via Randomized Smoothing with Differentially Private Guided Denoising Diffusion

[67] An Embedded Real-time Object Alert System for Visually Impaired: A Monocular Depth Estimation based Approach through Computer Vision

[68] HNOSeg-XS: Extremely Small Hartley Neural Operator for Efficient and Resolution-Robust 3D Image Segmentation

[69] SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

[70] Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework

[71] Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification

[72] Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone

[73] Cross-Resolution SAR Target Detection Using Structural Hierarchy Adaptation and Reliable Adjacency Alignment

[74] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

[75] Cross-Domain Identity Representation for Skull to Face Matching with Benchmark DataSet

[76] Interpretability-Aware Pruning for Efficient Medical Image Analysis

[77] CoCo-Bot: Energy-based Composable Concept Bottlenecks for Interpretable Generative Models