Motivation: 大语言模型在通用领域表现出色，但在法律等专业领域因缺乏领域特定预训练而表现不佳，且法律文本通常冗长复杂，难以有效处理。 Method: 在三个印度法律判决预测数据集上进行零样本实验，探索三种方法：按修辞角色重组文档、定义修辞角色以引入法律术语、模拟法院逐步推理过程。 Result: 组织数据或解释关键法律术语显著提升了模型性能，F1分数相比基线最少提高约1.5%，最高提升达4.36%。 Conclusion: 通过结构化信息呈现和引入法律术语知识，可有效增强大语言模型在法律领域的理解和推理能力，无需额外的领域微调。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子，统一了深度学习中各种动态采样方法，并对其进行了理论分析，揭示了其在前向与反向传播中的不对称性及其与传统卷积算子的本质区别，同时探讨了动态采样网络稳定训练的条件和离散化效应，并提出了利用梯度更新信息的新颖损失景观可视化方法。

Details

Motivation: 现有的动态采样机制在多个计算机视觉模型中表现出色，但缺乏统一的理论分析框架。作者希望建立一个通用的形式化工具来连接和解释不同的动态采样方法。 Method: 提出并分析一种称为“warping”的广义算子，该算子可还原多种现有架构（如可变形卷积、主动卷积单元和空间变换网络），并通过将输入建模为独立同分布变量和齐次随机场进行统计分析；引入基于梯度更新的损失景观可视化方法。 Result: 证明了warping算子在数学上构成一类不同于传统平移不变卷积算子的正交算子类别；发现了前向与反向传播之间的独特不对称性；明确了动态采样网络稳定训练的条件；分析了离散化带来的统计影响；提出了新的损失景观可视化技术。 Conclusion: 动态采样机制代表了一类全新的算子类型，需采用新的理论视角进行理解和优化，本文提供的形式化框架为未来相关模型的设计与分析奠定了基础。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 本文提出了Δ-NeRF，一种用于增量式NeRF精炼的模块化残差框架，适用于数据按序到达的场景（如卫星遥感）。该方法通过残差控制器、不确定性感知门控机制和视图选择策略，在无需重训且不遗忘历史信息的前提下实现高效更新，并结合知识蒸馏压缩模型，显著减少训练时间并保持优越性能。

Details

Motivation: 现有NeRF方法在新视角增量加入时通常需要重新训练，易导致灾难性遗忘，难以适应如卫星观测等数据序列化到达的应用场景，因此亟需一种支持持续学习的增量式NeRF框架。 Method: 提出Δ-NeRF，采用冻结的基础NeRF与残差控制器相结合的方式，逐层注入修正；引入不确定性感知门控机制自适应融合基础与精调预测；设计视图选择策略减少训练数据量；并通过知识蒸馏将增强模型压缩为小型学生网络。 Result: 在卫星图像上的实验表明，Δ-NeRF性能媲美联合训练，训练时间减少30-42%；相比朴素微调PSNR最高提升43.5%，并在某些指标上优于联合训练；模型可压缩至原大小的20%。 Conclusion: Δ-NeRF有效解决了增量场景下NeRF更新中的灾难性遗忘问题，实现了高效、紧凑且高性能的持续3D场景建模，特别适用于长时间序列观测任务如地形监测。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge（StM）框架，通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details

Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景与背景层，通过自组合学习动态主体与场景的交互；引入变换感知训练流程、多层融合增强和身份保持损失。 Result: 在定量基准和人类/VLLM定性评估中均优于现有最先进方法。 Conclusion: StM能有效学习复杂视频组成动态，实现更逼真且可控的视频生成。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境，包含25种任务类型，评估显示当前大模型表现远低于人类，而使用可验证奖励的强化学习能显著提升性能。

Details

Motivation: 旨在构建一个针对核心认知能力的视觉推理环境，提供可验证的真值解以支持精确评估和大规模数据集构建。 Method: 通过程序化生成基于图案、图块、图表、图标和几何原语的谜题，设计25类视觉推理任务，并采用强化学习与可验证奖励（RLVR）来提升模型性能。 Result: 最先进的LVLM如GPT-5在该基准上仅达到51.1%的准确率，远低于人类水平；引入RLVR后模型准确率显著提升，并在外迁视觉推理任务上也取得增益。 Conclusion: Sphinx为视觉推理提供了具有挑战性的测试平台，且RLVR是一种有前景的多模态推理提升方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI，用于替代文本到图像生成中昂贵的扩散先验网络，并引入两种新约束提升生成质量，实验表明其性能可与现有先进方法相媲美。

Details

Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络，本文旨在挑战这一必要性，探索更高效替代方案。 Method: 提出基于优化的视觉反演（OVI），通过随机伪标记初始化潜在视觉表示，并迭代优化以最大化与文本提示嵌入的余弦相似度；同时引入Mahalanobis和最近邻损失两种新约束来正则化优化过程。 Result: 在Kandinsky 2.2上实验显示，OVI可有效替代传统先验；分析发现当前T2I-CompBench++等基准存在缺陷，仅用文本嵌入作先验即可得高分；所提约束方法尤其是最近邻法显著提升视觉保真度，定量指标达到或超过现有先进数据高效先验。 Conclusion: OVI作为一种无需训练的先验替代方案具有潜力，且当前评估基准需重新审视，该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Canvas-to-Image的统一框架，通过将多种异构控制信号（如文本提示、参考图像、空间布局等）编码为单一的复合画布图像，实现高保真、多模态的图像生成。

Details

Motivation: 现有扩散模型在同时处理文本、参考图像、姿态、布局等多种控制信号时难以保证生成图像的忠实性和组合性。 Method: 将多种控制信号整合到一个画布中，并采用多任务画布训练策略，在统一的学习范式下联合优化模型对异构控制的理解与融合。 Result: 实验表明，该方法在多任务基准（如多人组合、姿态控制、布局约束和多控制生成）上显著优于现有最先进方法，尤其在身份保持和控制一致性方面表现突出。 Conclusion: Canvas-to-Image实现了对多样化用户意图的精确建模，支持复杂场景下的高保真图像生成，具备良好的泛化能力。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[15] Length-MAX Tokenizer for Language Models

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[29] Developing an Open Conversational Speech Corpus for the Isan Language

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[36] A Systematic Study of Model Merging Techniques in Large Language Models

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

[51] Foundry: Distilling 3D Foundation Models for the Edge

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

[54] Text-Guided Semantic Image Encoder

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

[58] Intriguing Properties of Dynamic Sampling Networks

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

[60] Layer-Aware Video Composition via Split-then-Merge

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

[66] Estimating Fog Parameters from a Sequence of Stereo Images

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

[69] GaINeR: Geometry-Aware Implicit Network Representation

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

[71] Smooth regularization for efficient video recognition

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI