Skip to content

Table of Contents

cs.CL [Back]

[1] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici,Eric Bieber,Mike Schaekermann,Ice Pasupat,Noveen Sachdeva,Inderjit Dhillon,Marcel Blistein,Ori Ram,Dan Zhang,Evan Rosen,Luke Marris,Sam Petulla,Colin Gaffney,Asaf Aharoni,Nathan Lintz,Tiago Cardal Pais,Henrik Jacobsson,Idan Szpektor,Nan-Jiang Jiang,Krishna Haridasan,Ahmed Omran,Nikunj Saunshi,Dara Bahri,Gaurav Mishra,Eric Chu,Toby Boyd,Brad Hekman,Aaron Parisi,Chaoyi Zhang,Kornraphop Kawintiranon,Tania Bedrax-Weiss,Oliver Wang,Ya Xu,Ollie Purkiss,Uri Mendlovic,Ilaï Deutel,Nam Nguyen,Adam Langley,Flip Korn,Lucia Rossazza,Alexandre Ramé,Sagar Waghmare,Helen Miller,Vaishakh Keshava,Ying Jian,Xiaofan Zhang,Raluca Ada Popa,Kedar Dhamdhere,Blaž Bratanič,Kyuyeun Kim,Terry Koo,Ferran Alet,Yi-ting Chen,Arsha Nagrani,Hannah Muckenhirn,Zhiyuan Zhang,Corbin Quick,Filip Pavetić,Duc Dung Nguyen,Joao Carreira,Michael Elabd,Haroon Qureshi,Fabian Mentzer,Yao-Yuan Yang,Danielle Eisenbud,Anmol Gulati,Ellie Talius,Eric Ni,Sahra Ghalebikesabi,Edouard Yvinec,Alaa Saade,Thatcher Ulrich,Lorenzo Blanco,Dan A. Calian,Muhuan Huang,Aäron van den Oord,Naman Goyal,Terry Chen,Praynaa Rawlani,Christian Schallhart,Swachhand Lokhande,Xianghong Luo,Jyn Shan,Ceslee Montgomery,Victoria Krakovna,Federico Piccinini,Omer Barak,Jingyu Cui,Yiling Jia,Mikhail Dektiarev,Alexey Kolganov,Shiyu Huang,Zhe Chen,Xingyu Wang,Jessica Austin,Peter de Boursac,Evgeny Sluzhaev,Frank Ding,Huijian Li,Surya Bhupatiraju,Mohit Agarwal,Sławek Kwasiborski,Paramjit Sandhu,Patrick Siegler,Ahmet Iscen,Eyal Ben-David,Shiraz Butt,Miltos Allamanis,Seth Benjamin,Robert Busa-Fekete,Felix Hernandez-Campos,Sasha Goldshtein,Matt Dibb,Weiyang Zhang,Annie Marsden,Carey Radebaugh,Stephen Roller,Abhishek Nayyar,Jacob Austin,Tayfun Terzi,Bhargav Kanagal Shamanna,Pete Shaw,Aayush Singh,Florian Luisier,Artur Mendonça,Vaibhav Aggarwal,Larisa Markeeva,Claudio Fantacci,Sergey Brin,HyunJeong Choe,Guanyu Wang,Hartwig Adam,Avigail Dabush,Tatsuya Kiyono,Eyal Marcus,Jeremy Cole,Theophane Weber,Hongrae Lee,Ronny Huang,Alex Muzio,Leandro Kieliger,Maigo Le,Courtney Biles,Long Le,Archit Sharma,Chengrun Yang,Avery Lamp,Dave Dopson,Nate Hurley,Katrina,Xu,Zhihao Shan,Shuang Song,Jiewen Tan,Alexandre Senges,George Zhang,Chong You,Yennie Jun,David Raposo,Susanna Ricco,Xuan Yang,Weijie Chen,Prakhar Gupta,Arthur Szlam,Kevin Villela,Chun-Sung Ferng,Daniel Kasenberg,Chen Liang,Rui Zhu,Arunachalam Narayanaswamy,Florence Perot,Paul Pucciarelli,Anna Shekhawat,Alexey Stern,Rishikesh Ingale,Stefani Karp,Sanaz Bahargam,Adrian Goedeckemeyer,Jie Han,Sicheng Li,Andrea Tacchetti,Dian Yu,Abhishek Chakladar,Zhiying Zhang,Mona El Mahdy,Xu Gao,Dale Johnson,Samrat Phatale,AJ Piergiovanni,Hyeontaek Lim,Clement Farabet,Carl Lebsack,Theo Guidroz,John Blitzer,Nico Duduta,David Madras,Steve Li,Daniel von Dincklage,Xin Li,Mahdis Mahdieh,George Tucker,Ganesh Jawahar,Owen Xiao,Danny Tarlow,Robert Geirhos,Noam Velan,Daniel Vlasic,Kalesha Bullard,SK Park,Nishesh Gupta,Kellie Webster,Ayal Hitron,Jieming Mao,Julian Eisenschlos,Laurel Prince,Nina D'Souza,Kelvin Zheng,Sara Nasso,Gabriela Botea,Carl Doersch,Caglar Unlu,Chris Alberti,Alexey Svyatkovskiy,Ankita Goel,Krzysztof Choromanski,Pan-Pan Jiang,Richard Nguyen,Four Flynn,Daria Ćurko,Peter Chen,Nicholas Roth,Kieran Milan,Caleb Habtegebriel,Shashi Narayan,Michael Moffitt,Jake Marcus,Thomas Anthony,Brendan McMahan,Gowoon Cheon,Ruibo Liu,Megan Barnes,Lukasz Lew,Rebeca Santamaria-Fernandez,Mayank Upadhyay,Arjun Akula,Arnar Mar Hrafnkelsson,Alvaro Caceres,Andrew Bunner,Michal Sokolik,Subha Puttagunta,Lawrence Moore,Berivan Isik,Weilun Chen,Jay Hartford,Lawrence Chan,Pradeep Shenoy,Dan Holtmann-Rice,Jane Park,Fabio Viola,Alex Salcianu,Sujeevan Rajayogam,Ian Stewart-Binks,Zelin Wu,Richard Everett,Xi Xiong,Pierre-Antoine Manzagol,Gary Leung,Carl Saroufim,Bo Pang,Dawid Wegner,George Papamakarios,Jennimaria Palomaki,Helena Pankov,Guangda Lai,Guilherme Tubone,Shubin Zhao,Theofilos Strinopoulos,Seth Neel,Mingqiu Wang,Joe Kelley,Li Li,Pingmei Xu,Anitha Vijayakumar,Andrea D'olimpio,Omer Levy,Massimo Nicosia,Grigory Rozhdestvenskiy,Ni Lao,Sirui Xie,Yash Katariya,Jon Simon,Sanjiv Kumar,Florian Hartmann,Michael Kilgore,Jinhyuk Lee,Aroma Mahendru,Roman Ring,Tom Hennigan,Fiona Lang,Colin Cherry,David Steiner,Dawsen Hwang,Ray Smith,Pidong Wang,Jeremy Chen,Ming-Hsuan Yang,Sam Kwei,Philippe Schlattner,Donnie Kim,Ganesh Poomal Girirajan,Nikola Momchev,Ayushi Agarwal,Xingyi Zhou,Ilkin Safarli,Zachary Garrett,AJ Pierigiovanni,Sarthak Jauhari,Alif Raditya Rochman,Shikhar Vashishth,Quan Yuan,Christof Angermueller,Jon Blanton,Xinying Song,Nitesh Bharadwaj Gundavarapu,Thi Avrahami,Maxine Deines,Subhrajit Roy,Manish Gupta,Christopher Semturs,Shobha Vasudevan,Aditya Srikanth Veerubhotla,Shriya Sharma,Josh Jacob,Zhen Yang,Andreas Terzis,Dan Karliner,Auriel Wright,Tania Rojas-Esponda,Ashley Brown,Abhijit Guha Roy,Pawan Dogra,Andrei Kapishnikov,Peter Young,Wendy Kan,Vinodh Kumar Rajendran,Maria Ivanova,Salil Deshmukh,Chia-Hua Ho,Mike Kwong,Stav Ginzburg,Annie Louis,KP Sawhney,Slav Petrov,Jing Xie,Yunfei Bai,Georgi Stoyanov,Alex Fabrikant,Rajesh Jayaram,Yuqi Li,Joe Heyward,Justin Gilmer,Yaqing Wang,Radu Soricut,Luyang Liu,Qingnan Duan,Jamie Hayes,Maura O'Brien,Gaurav Singh Tomar,Sivan Eiger,Bahar Fatemi,Jeffrey Hui,Catarina Barros,Adaeze Chukwuka,Alena Butryna,Saksham Thakur,Austin Huang,Zhufeng Pan,Haotian Tang,Serkan Cabi,Tulsee Doshi,Michiel Bakker,Sumit Bagri,Ruy Ley-Wild,Adam Lelkes,Jennie Lees,Patrick Kane,David Greene,Shimu Wu,Jörg Bornschein,Gabriela Surita,Sarah Hodkinson,Fangtao Li,Chris Hidey,Sébastien Pereira,Sean Ammirati,Phillip Lippe,Adam Kraft,Pu Han,Sebastian Gerlach,Zifeng Wang,Liviu Panait,Feng Han,Brian Farris,Yingying Bi,Hannah DeBalsi,Miaosen Wang,Gladys Tyen,James Cohan,Susan Zhang,Jarred Barber,Da-Woon Chung,Jaeyoun Kim,Markus Kunesch,Steven Pecht,Nami Akazawa,Abe Friesen,James Lyon,Ali Eslami,Junru Wu,Jie Tan,Yue Song,Ravi Kumar,Chris Welty,Ilia Akolzin,Gena Gibson,Sean Augenstein,Arjun Pillai,Nancy Yuen,Du Phan,Xin Wang,Iain Barr,Heiga Zen,Nan Hua,Casper Liu,Jilei,Wang,Tanuj Bhatia,Hao Xu,Oded Elyada,Pushmeet Kohli,Mirek Olšák,Ke Chen,Azalia Mirhoseini,Noam Shazeer,Shoshana Jakobovits,Maggie Tran,Nolan Ramsden,Tarun Bharti,Fred Alcober,Yunjie Li,Shilpa Shetty,Jing Chen,Dmitry Kalashnikov,Megha Nawhal,Sercan Arik,Hanwen Chen,Michiel Blokzijl,Shubham Gupta,James Rubin,Rigel Swavely,Sophie Bridgers,Ian Gemp,Chen Su,Arun Suggala,Juliette Pluto,Mary Cassin,Alain Vaucher,Kaiyang Ji,Jiahao Cai,Andrew Audibert,Animesh Sinha,David Tian,Efrat Farkash,Amy Hua,Jilin Chen,Duc-Hieu Tran,Edward Loper,Nicole Brichtova,Lara McConnaughey,Ballie Sandhu,Robert Leland,Doug DeCarlo,Andrew Over,James Huang,Xing Wu,Connie Fan,Eric Li,Yun Lei,Deepak Sharma,Cosmin Paduraru,Luo Yu,Matko Bošnjak,Phuong Dao,Min Choi,Sneha Kudugunta,Jakub Adamek,Carlos Guía,Ali Khodaei,Jie Feng,Wenjun Zeng,David Welling,Sandeep Tata,Christina Butterfield,Andrey Vlasov,Seliem El-Sayed,Swaroop Mishra,Tara Sainath,Shentao Yang,RJ Skerry-Ryan,Jeremy Shar,Robert Berry,Arunkumar Rajendran,Arun Kandoor,Andrea Burns,Deepali Jain,Tom Stone,Wonpyo Park,Shibo Wang,Albin Cassirer,Guohui Wang,Hayato Kobayashi,Sergey Rogulenko,Vineetha Govindaraj,Mikołaj Rybiński,Nadav Olmert,Colin Evans,Po-Sen Huang,Kelvin Xu,Premal Shah,Terry Thurk,Caitlin Sikora,Mu Cai,Jin Xie,Elahe Dabir,Saloni Shah,Norbert Kalb,Carrie Zhang,Shruthi Prabhakara,Amit Sabne,Artiom Myaskovsky,Vikas Raunak,Blanca Huergo,Behnam Neyshabur,Jon Clark,Ye Zhang,Shankar Krishnan,Eden Cohen,Dinesh Tewari,James Lottes,Yumeya Yamamori,Hui,Li,Mohamed Elhawaty,Ada Maksutaj Oflazer,Adrià Recasens,Sheryl Luo,Duy Nguyen,Taylor Bos,Kalyan Andra,Ana Salazar,Ed Chi,Jeongwoo Ko,Matt Ginsberg,Anders Andreassen,Anian Ruoss,Todor Davchev,Elnaz Davoodi,Chenxi Liu,Min Kim,Santiago Ontanon,Chi Ming To,Dawei Jia,Rosemary Ke,Jing Wang,Anna Korsun,Moran Ambar,Ilya Kornakov,Irene Giannoumis,Toni Creswell,Denny Zhou,Yi Su,Ishaan Watts,Aleksandr Zaks,Evgenii Eltyshev,Ziqiang Feng,Sidharth Mudgal,Alex Kaskasoli,Juliette Love,Kingshuk Dasgupta,Sam Shleifer,Richard Green,Sungyong Seo,Chansoo Lee,Dale Webster,Prakash Shroff,Ganna Raboshchuk,Isabel Leal,James Manyika,Sofia Erell,Daniel Murphy,Zhisheng Xiao,Anton Bulyenov,Julian Walker,Mark Collier,Matej Kastelic,Nelson George,Sushant Prakash,Sailesh Sidhwani,Alexey Frolov,Steven Hansen,Petko Georgiev,Tiberiu Sosea,Chris Apps,Aishwarya Kamath,David Reid,Emma Cooney,Charlotte Magister,Oriana Riva,Alec Go,Pu-Chin Chen,Sebastian Krause,Nir Levine,Marco Fornoni,Ilya Figotin,Nick Roy,Parsa Mahmoudieh,Vladimir Magay,Mukundan Madhavan,Jin Miao,Jianmo Ni,Yasuhisa Fujii,Ian Chou,George Scrivener,Zak Tsai,Siobhan Mcloughlin,Jeremy Selier,Sandra Lefdal,Jeffrey Zhao,Abhijit Karmarkar,Kushal Chauhan,Shivanker Goel,Zhaoyi Zhang,Vihan Jain,Parisa Haghani,Mostafa Dehghani,Jacob Scott,Erin Farnese,Anastasija Ilić,Steven Baker,Julia Pawar,Li Zhong,Josh Camp,Yoel Zeldes,Shravya Shetty,Anand Iyer,Vít Listík,Jiaxian Guo,Luming Tang,Mark Geller,Simon Bucher,Yifan Ding,Hongzhi Shi,Carrie Muir,Dominik Grewe,Ramy Eskander,Octavio Ponce,Boqing Gong,Derek Gasaway,Samira Khan,Umang Gupta,Angelos Filos,Weicheng Kuo,Klemen Kloboves,Jennifer Beattie,Christian Wright,Leon Li,Alicia Jin,Sandeep Mariserla,Miteyan Patel,Jens Heitkaemper,Dilip Krishnan,Vivek Sharma,David Bieber,Christian Frank,John Lambert,Paul Caron,Martin Polacek,Mai Giménez,Himadri Choudhury,Xing Yu,Sasan Tavakkol,Arun Ahuja,Franz Och,Rodolphe Jenatton,Wojtek Skut,Bryan Richter,David Gaddy,Andy Ly,Misha Bilenko,Megh Umekar,Ethan Liang,Martin Sevenich,Mandar Joshi,Hassan Mansoor,Rebecca Lin,Sumit Sanghai,Abhimanyu Singh,Xiaowei Li,Sudheendra Vijayanarasimhan,Zaheer Abbas,Yonatan Bitton,Hansa Srinivasan,Manish Reddy Vuyyuru,Alexander Frömmgen,Yanhua Sun,Ralph Leith,Alfonso Castaño,DJ Strouse,Le Yan,Austin Kyker,Satish Kambala,Mary Jasarevic,Thibault Sellam,Chao Jia,Alexander Pritzel,Raghavender R,Huizhong Chen,Natalie Clay,Sudeep Gandhe,Sean Kirmani,Sayna Ebrahimi,Hannah Kirkwood,Jonathan Mallinson,Chao Wang,Adnan Ozturel,Kuo Lin,Shyam Upadhyay,Vincent Cohen-Addad,Sean Purser-haskell,Yichong Xu,Ebrahim Songhori,Babi Seal,Alberto Magni,Almog Gueta,Tingting Zou,Guru Guruganesh,Thais Kagohara,Hung Nguyen,Khalid Salama,Alejandro Cruzado Ruiz,Justin Frye,Zhenkai Zhu,Matthias Lochbrunner,Simon Osindero,Wentao Yuan,Lisa Lee,Aman Prasad,Lam Nguyen Thiet,Daniele Calandriello,Victor Stone,Qixuan Feng,Han Ke,Maria Voitovich,Geta Sampemane,Lewis Chiang,Ling Wu,Alexander Bykovsky,Matt Young,Luke Vilnis,Ishita Dasgupta,Aditya Chawla,Qin Cao,Bowen Liang,Daniel Toyama,Szabolcs Payrits,Anca Stefanoiu,Dimitrios Vytiniotis,Ankesh Anand,Tianxiao Shen,Blagoj Mitrevski,Michael Tschannen,Sreenivas Gollapudi,Aishwarya P S,José Leal,Zhe Shen,Han Fu,Wei Wang,Arvind Kannan,Doron Kukliansky,Sergey Yaroshenko,Svetlana Grant,Umesh Telang,David Wood,Alexandra Chronopoulou,Alexandru Ţifrea,Tao Zhou,Tony,Nguy\~ên,Muge Ersoy,Anima Singh,Meiyan Xie,Emanuel Taropa,Woohyun Han,Eirikur Agustsson,Andrei Sozanschi,Hui Peng,Alex Chen,Yoel Drori,Efren Robles,Yang Gao,Xerxes Dotiwalla,Ying Chen,Anudhyan Boral,Alexei Bendebury,John Nham,Chris Tar,Luis Castro,Jiepu Jiang,Canoee Liu,Felix Halim,Jinoo Baek,Andy Wan,Jeremiah Liu,Yuan Cao,Shengyang Dai,Trilok Acharya,Ruoxi Sun,Fuzhao Xue,Saket Joshi,Morgane Lustman,Yongqin Xian,Rishabh Joshi,Deep Karkhanis,Nora Kassner,Jamie Hall,Xiangzhuo Ding,Gan Song,Gang Li,Chen Zhu,Yana Kulizhskaya,Bin Ni,Alexey Vlaskin,Solomon Demmessie,Lucio Dery,Salah Zaiem,Yanping Huang,Cindy Fan,Felix Gimeno,Ananth Balashankar,Koji Kojima,Hagai Taitelbaum,Maya Meng,Dero Gharibian,Sahil Singla,Wei Chen,Ambrose Slone,Guanjie Chen,Sujee Rajayogam,Max Schumacher,Suyog Kotecha,Rory Blevins,Qifei Wang,Mor Hazan Taege,Alex Morris,Xin Liu,Fayaz Jamil,Richard Zhang,Pratik Joshi,Ben Ingram,Tyler Liechty,Ahmed Eleryan,Scott Baird,Alex Grills,Gagan Bansal,Shan Han,Kiran Yalasangi,Shawn Xu,Majd Al Merey,Isabel Gao,Felix Weissenberger,Igor Karpov,Robert Riachi,Ankit Anand,Gautam Prasad,Kay Lamerigts,Reid Hayes,Jamie Rogers,Mandy Guo,Ashish Shenoy,Qiong,Hu,Kyle He,Yuchen Liu,Polina Zablotskaia,Sagar Gubbi,Yifan Chang,Jay Pavagadhi,Kristian Kjems,Archita Vadali,Diego Machado,Yeqing Li,Renshen Wang,Dipankar Ghosh,Aahil Mehta,Dana Alon,George Polovets,Alessio Tonioni,Nate Kushman,Joel D'sa,Lin Zhuo,Allen Wu,Rohin Shah,John Youssef,Jiayu Ye,Justin Snyder,Karel Lenc,Senaka Buthpitiya,Matthew Tung,Jichuan Chang,Tao Chen,David Saxton,Jenny Lee,Lydia Lihui Zhang,James Qin,Prabakar Radhakrishnan,Maxwell Chen,Piotr Ambroszczyk,Metin Toksoz-Exley,Yan Zhong,Nitzan Katz,Brendan O'Donoghue,Tamara von Glehn,Adi Gerzi Rosenthal,Aga Świetlik,Xiaokai Zhao,Nick Fernando,Jinliang Wei,Jieru Mei,Sergei Vassilvitskii,Diego Cedillo,Pranjal Awasthi,Hui Zheng,Koray Kavukcuoglu,Itay Laish,Joseph Pagadora,Marc Brockschmidt,Christopher A. Choquette-Choo,Arunkumar Byravan,Yifeng Lu,Xu Chen,Mia Chen,Kenton Lee,Rama Pasumarthi,Sijal Bhatnagar,Aditya Shah,Qiyin Wu,Zhuoyuan Chen,Zack Nado,Bartek Perz,Zixuan Jiang,David Kao,Ganesh Mallya,Nino Vieillard,Lantao Mei,Sertan Girgin,Mandy Jordan,Yeongil Ko,Alekh Agarwal,Yaxin Liu,Yasemin Altun,Raoul de Liedekerke,Anastasios Kementsietsidis,Daiyi Peng,Dangyi Liu,Utku Evci,Peter Humphreys,Austin Tarango,Xiang Deng,Yoad Lewenberg,Kevin Aydin,Chengda Wu,Bhavishya Mittal,Tsendsuren Munkhdalai,Kleopatra Chatziprimou,Rodrigo Benenson,Uri First,Xiao Ma,Jinning Li,Armand Joulin,Hamish Tomlinson,Tingnan Zhang,Milad Nasr,Zhi Hong,Michaël Sander,Lisa Anne Hendricks,Anuj Sharma,Andrew Bolt,Eszter Vértes,Jiri Simsa,Tomer Levinboim,Olcan Sercinoglu,Divyansh Shukla,Austin Wu,Craig Swanson,Danny Vainstein,Fan Bu,Bo Wang,Ryan Julian,Charles Yoon,Sergei Lebedev,Antonious Girgis,Bernd Bandemer,David Du,Todd Wang,Xi Chen,Ying Xiao,Peggy Lu,Natalie Ha,Vlad Ionescu,Simon Rowe,Josip Matak,Federico Lebron,Andreas Steiner,Lalit Jain,Manaal Faruqui,Nicolas Lacasse,Georgie Evans,Neesha Subramaniam,Dean Reich,Giulia Vezzani,Aditya Pandey,Joe Stanton,Tianhao Zhou,Liam McCafferty,Henry Griffiths,Verena Rieser,Soheil Hassas Yeganeh,Eleftheria Briakou,Lu Huang,Zichuan Wei,Liangchen Luo,Erik Jue,Gabby Wang,Victor Cotruta,Myriam Khan,Jongbin Park,Qiuchen Guo,Peiran Li,Rong Rong,Diego Antognini,Anastasia Petrushkina,Chetan Tekur,Eli Collins,Parul Bhatia,Chester Kwak,Wenhu Chen,Arvind Neelakantan,Immanuel Odisho,Sheng Peng,Vincent Nallatamby,Vaibhav Tulsyan,Fabian Pedregosa,Peng Xu,Raymond Lin,Yulong Wang,Emma Wang,Sholto Douglas,Reut Tsarfaty,Elena Gribovskaya,Renga Aravamudhan,Manu Agarwal,Mara Finkelstein,Qiao Zhang,Elizabeth Cole,Phil Crone,Sarmishta Velury,Anil Das,Chris Sauer,Luyao Xu,Danfeng Qin,Chenjie Gu,Dror Marcus,CJ Zheng,Wouter Van Gansbeke,Sobhan Miryoosefi,Haitian Sun,YaGuang Li,Charlie Chen,Jae Yoo,Pavel Dubov,Alex Tomala,Adams Yu,Paweł Wesołowski,Alok Gunjan,Eddie Cao,Jiaming Luo,Nikhil Sethi,Arkadiusz Socala,Laura Graesser,Tomas Kocisky,Arturo BC,Minmin Chen,Edward Lee,Sophie Wang,Weize Kong,Qiantong Xu,Nilesh Tripuraneni,Yiming Li,Xinxin Yu,Allen Porter,Paul Voigtlaender,Biao Zhang,Arpi Vezer,Sarah York,Qing Wei,Geoffrey Cideron,Mark Kurzeja,Seungyeon Kim,Benny Li,Angéline Pouget,Hyo Lee,Kaspar Daugaard,Yang Li,Dave Uthus,Aditya Siddhant,Paul Cavallaro,Sriram Ganapathy,Maulik Shah,Rolf Jagerman,Jeff Stanway,Piermaria Mendolicchio,Li Xiao,Kayi Lee,Tara Thompson,Shubham Milind Phal,Jason Chase,Sun Jae Lee,Adrian N Reyes,Disha Shrivastava,Zhen Qin,Roykrong Sukkerd,Seth Odoom,Lior Madmoni,John Aslanides,Jonathan Herzig,Elena Pochernina,Sheng Zhang,Parker Barnes,Daisuke Ikeda,Qiujia Li,Shuo-yiin Chang,Shakir Mohamed,Jim Sproch,Richard Powell,Bidisha Samanta,Domagoj Ćevid,Anton Kovsharov,Shrestha Basu Mallick,Srinivas Tadepalli,Anne Zheng,Kareem Ayoub,Andreas Noever,Christian Reisswig,Zhuo Xu,Junhyuk Oh,Martin Matysiak,Tim Blyth,Shereen Ashraf,Julien Amelot,Boone Severson,Michele Bevilacqua,Motoki Sano,Ethan Dyer,Ofir Roval,Anu Sinha,Yin Zhong,Sagi Perel,Tea Sabolić,Johannes Mauerer,Willi Gierke,Mauro Verzetti,Rodrigo Cabrera,Alvin Abdagic,Steven Hemingray,Austin Stone,Jong Lee,Farooq Ahmad,Karthik Raman,Lior Shani,Jonathan Lai,Orhan Firat,Nathan Waters,Eric Ge,Mo Shomrat,Himanshu Gupta,Rajeev Aggarwal,Tom Hudson,Bill Jia,Simon Baumgartner,Palak Jain,Joe Kovac,Junehyuk Jung,Ante Žužul,Will Truong,Morteza Zadimoghaddam,Songyou Peng,Marco Liang,Rachel Sterneck,Balaji Lakshminarayanan,Machel Reid,Oliver Woodman,Tong Zhou,Jianling Wang,Vincent Coriou,Arjun Narayanan,Jay Hoover,Yenai Ma,Apoorv Jindal,Clayton Sanford,Doug Reid,Swaroop Ramaswamy,Alex Kurakin,Roland Zimmermann,Yana Lunts,Dragos Dena,Zalán Borsos,Vered Cohen,Shujian Zhang,Will Grathwohl,Robert Dadashi,Morgan Redshaw,Joshua Kessinger,Julian Odell,Silvano Bonacina,Zihang Dai,Grace Chen,Ayush Dubey,Pablo Sprechmann,Mantas Pajarskas,Wenxuan Zhou,Niharika Ahuja,Tara Thomas,Martin Nikoltchev,Matija Kecman,Bharath Mankalale,Andrey Ryabtsev,Jennifer She,Christian Walder,Jiaming Shen,Lu Li,Carolina Parada,Sheena Panthaplackel,Okwan Kwon,Matt Lawlor,Utsav Prabhu,Yannick Schroecker,Marc'aurelio Ranzato,Pete Blois,Iurii Kemaev,Ting Yu,Dmitry,Lepikhin,Hao Xiong,Sahand Sharifzadeh,Oleaser Johnson,Jeremiah Willcock,Rui Yao,Greg Farquhar,Sujoy Basu,Hidetoshi Shimokawa,Nina Anderson,Haiguang Li,Khiem Pham,Yizhong Liang,Sebastian Borgeaud,Alexandre Moufarek,Hideto Kazawa,Blair Kutzman,Marcin Sieniek,Sara Smoot,Ruth Wang,Natalie Axelsson,Nova Fallen,Prasha Sundaram,Yuexiang Zhai,Varun Godbole,Petros Maniatis,Alek Wang,Ilia Shumailov,Santhosh Thangaraj,Remi Crocker,Nikita Gupta,Gang Wu,Phil Chen,Gellért Weisz,Celine Smith,Mojtaba Seyedhosseini,Boya Fang,Xiyang Luo,Roey Yogev,Zeynep Cankara,Andrew Hard,Helen Ran,Rahul Sukthankar,George Necula,Gaël Liu,Honglong Cai,Praseem Banzal,Daniel Keysers,Sanjay Ghemawat,Connie Tao,Emma Dunleavy,Aditi Chaudhary,Wei Li,Maciej Mikuła,Chen-Yu Lee,Tiziana Refice,Krishna Somandepalli,Alexandre Fréchette,Dan Bahir,John Karro,Keith Rush,Sarah Perrin,Bill Rosgen,Xiaomeng Yang,Clara Huiyi Hu,Mahmoud Alnahlawi,Justin Mao-Jones,Roopal Garg,Hoang Nguyen,Bat-Orgil Batsaikhan,Iñaki Iturrate,Anselm Levskaya,Avi Singh,Ashyana Kachra,Tony Lu,Denis Petek,Zheng Xu,Mark Graham,Lukas Zilka,Yael Karov,Marija Kostelac,Fangyu Liu,Yaohui Guo,Weiyue Wang,Bernd Bohnet,Emily Pitler,Tony Bruguier,Keisuke Kinoshita,Chrysovalantis Anastasiou,Nilpa Jha,Ting Liu,Jerome Connor,Phil Wallis,Philip Pham,Eric Bailey,Shixin Li,Heng-Tze Cheng,Sally Ma,Haiqiong Li,Akanksha Maurya,Kate Olszewska,Manfred Warmuth,Christy Koh,Dominik Paulus,Siddhartha Reddy Jonnalagadda,Enrique Piqueras,Ali Elqursh,Geoff Brown,Hadar Shemtov,Loren Maggiore,Fei Xia,Ryan Foley,Beka Westberg,George van den Driessche,Livio Baldini Soares,Arjun Kar,Michael Quinn,Siqi Zuo,Jialin Wu,Kyle Kastner,Anna Bortsova,Aijun Bai,Ales Mikhalap,Luowei Zhou,Jennifer Brennan,Vinay Ramasesh,Honglei Zhuang,John Maggs,Johan Schalkwyk,Yuntao Xu,Hui Huang,Andrew Howard,Sasha Brown,Linting Xue,Gloria Shen,Brian Albert,Neha Jha,Daniel Zheng,Varvara Krayvanova,Spurthi Amba Hombaiah,Olivier Lacombe,Gautam Vasudevan,Dan Graur,Tian Xie,Meet Gandhi,Bangju Wang,Dustin Zelle,Harman Singh,Dahun Kim,Sébastien Cevey,Victor Ungureanu,Natasha Noy,Fei Liu,Annie Xie,Fangxiaoyu Feng,Katerina Tsihlas,Daniel Formoso,Neera Vats,Quentin Wellens,Yinan Wang,Niket Kumar Bhumihar,Samrat Ghosh,Matt Hoffman,Tom Lieber,Oran Lang,Kush Bhatia,Tom Paine,Aroonalok Pyne,Ronny Votel,Madeleine Clare Elish,Benoit Schillings,Alex Panagopoulos,Haichuan Yang,Adam Raveret,Zohar Yahav,Shuang Liu,Warren Chen,Dalia El Badawy,Nishant Agrawal,Mohammed Badawi,Mahdi Mirzazadeh,Carla Bromberg,Fan Ye,Chang Liu,Tatiana Sholokhova,George-Cristian Muraru,Gargi Balasubramaniam,Jonathan Malmaud,Alen Carin,Danilo Martins,Irina Jurenka,Pankil Botadra,Dave Lacey,Richa Singh,Mariano Schain,Dan Zheng,Isabelle Guyon,Victor Lavrenko,Seungji Lee,Xiang Zhou,Demis Hassabis,Jeshwanth Challagundla,Derek Cheng,Nikhil Mehta,Matthew Mauger,Michela Paganini,Pushkar Mishra,Kate Lee,Zhang Li,Lexi Baugher,Ondrej Skopek,Max Chang,Amir Zait,Gaurav Menghani,Lizzetth Bellot,Guangxing Han,Jean-Michel Sarr,Sharat Chikkerur,Himanshu Sahni,Rohan Anil,Arun Narayanan,Chandu Thekkath,Daniele Pighin,Hana Strejček,Marko Velic,Fred Bertsch,Manuel Tragut,Keran Rong,Alicia Parrish,Kai Bailey,Jiho Park,Isabela Albuquerque,Abhishek Bapna,Rajesh Venkataraman,Alec Kosik,Johannes Griesser,Zhiwei Deng,Alek Andreev,Qingyun Dou,Kevin Hui,Fanny Wei,Xiaobin Yu,Lei Shu,Avia Aharon,David Barker,Badih Ghazi,Sebastian Flennerhag,Chris Breaux,Yuchuan Liu,Matthew Bilotti,Josh Woodward,Uri Alon,Stephanie Winkler,Tzu-Kuo Huang,Kostas Andriopoulos,João Gabriel Oliveira,Penporn Koanantakool,Berkin Akin,Michael Wunder,Cicero Nogueira dos Santos,Mohammad Hossein Bateni,Lin Yang,Dan Horgan,Beer Changpinyo,Keyvan Amiri,Min Ma,Dayeong Lee,Lihao Liang,Anirudh Baddepudi,Tejasi Latkar,Raia Hadsell,Jun Xu,Hairong Mu,Michael Han,Aedan Pope,Snchit Grover,Frank Kim,Ankit Bhagatwala,Guan Sun,Yamini Bansal,Amir Globerson,Alireza Nazari,Samira Daruki,Hagen Soltau,Jane Labanowski,Laurent El Shafey,Matt Harvey,Yanif Ahmad,Elan Rosenfeld,William Kong,Etienne Pot,Yi-Xuan Tan,Aurora Wei,Victoria Langston,Marcel Prasetya,Petar Veličković,Richard Killam,Robin Strudel,Darren Ni,Zhenhai Zhu,Aaron Archer,Kavya Kopparapu,Lynn Nguyen,Emilio Parisotto,Hussain Masoom,Sravanti Addepalli,Jordan Grimstad,Hexiang Hu,Joss Moore,Avinatan Hassidim,Le Hou,Mukund Raghavachari,Jared Lichtarge,Adam R. Brown,Hilal Dib,Natalia Ponomareva,Justin Fu,Yujing Zhang,Altaf Rahman,Joana Iljazi,Edouard Leurent,Gabriel Dulac-Arnold,Cosmo Du,Chulayuth Asawaroengchai,Larry Jin,Ela Gruzewska,Ziwei Ji,Benigno Uria,Daniel De Freitas,Paul Barham,Lauren Beltrone,Víctor Campos,Jun Yan,Neel Kovelamudi,Arthur Nguyen,Elinor Davies,Zhichun Wu,Zoltan Egyed,Kristina Toutanova,Nithya Attaluri,Hongliang Fei,Peter Stys,Siddhartha Brahma,Martin Izzard,Siva Velusamy,Scott Lundberg,Vincent Zhuang,Kevin Sequeira,Adam Santoro,Ehsan Amid,Ophir Aharoni,Shuai Ye,Mukund Sundararajan,Lijun Yu,Yu-Cheng Ling,Stephen Spencer,Hugo Song,Josip Djolonga,Christo Kirov,Sonal Gupta,Alessandro Bissacco,Clemens Meyer,Mukul Bhutani,Andrew Dai,Weiyi Wang,Siqi Liu,Ashwin Sreevatsa,Qijun Tan,Maria Wang,Lucy Kim,Yicheng Wang,Alex Irpan,Yang Xiao,Stanislav Fort,Yifan He,Alex Gurney,Bryan Gale,Yue Ma,Monica Roy,Viorica Patraucean,Taylan Bilal,Golnaz Ghiasi,Anahita Hosseini,Melvin Johnson,Zhuowan Li,Yi Tay,Benjamin Beyret,Katie Millican,Josef Broder,Mayank Lunayach,Danny Swisher,Eugen Vušak,David Parkinson,MH Tessler,Adi Mayrav Gilady,Richard Song,Allan Dafoe,Yves Raimond,Masa Yamaguchi,Itay Karo,Elizabeth Nielsen,Kevin Kilgour,Mike Dusenberry,Rajiv Mathews,Jiho Choi,Siyuan Qiao,Harsh Mehta,Sahitya Potluri,Chris Knutsen,Jialu Liu,Tat Tan,Kuntal Sengupta,Keerthana Gopalakrishnan,Abodunrinwa Toki,Mencher Chiang,Mike Burrows,Grace Vesom,Zafarali Ahmed,Ilia Labzovsky,Siddharth Vashishtha,Preeti Singh,Ankur Sharma,Ada Ma,Jinyu Xie,Pranav Talluri,Hannah Forbes-Pollard,Aarush Selvan,Joel Wee,Loic Matthey,Tom Funkhouser,Parthasarathy Gopavarapu,Lev Proleev,Cheng Li,Matt Thomas,Kashyap Kolipaka,Zhipeng Jia,Ashwin Kakarla,Srinivas Sunkara,Joan Puigcerver,Suraj Satishkumar Sheth,Emily Graves,Chen Wang,Sadh MNM Khan,Kai Kang,Shyamal Buch,Fred Zhang,Omkar Savant,David Soergel,Kevin Lee,Linda Friso,Xuanyi Dong,Rahul Arya,Shreyas Chandrakaladharan,Connor Schenck,Greg Billock,Tejas Iyer,Anton Bakalov,Leslie Baker,Alex Ruiz,Angad Chandorkar,Trieu Trinh,Matt Miecnikowski,Yanqi Zhou,Yangsibo Huang,Jiazhong Nie,Ali Shah,Ashish Thapliyal,Sam Haves,Lun Wang,Uri Shaham,Patrick Morris-Suzuki,Soroush Radpour,Leonard Berrada,Thomas Strohmann,Chaochao Yan,Jingwei Shen,Sonam Goenka,Tris Warkentin,Petar Dević,Dan Belov,Albert Webson,Madhavi Yenugula,Puranjay Datta,Jerry Chang,Nimesh Ghelani,Aviral Kumar,Vincent Perot,Jessica Lo,Yang Song,Herman Schmit,Jianmin Chen,Vasilisa Bashlovkina,Xiaoyue Pan,Diana Mincu,Paul Roit,Isabel Edkins,Andy Davis,Yujia Li,Ben Horn,Xinjian Li,Pradeep Kumar S,Eric Doi,Wanzheng Zhu,Sri Gayatri Sundara Padmanabhan,Siddharth Verma,Jasmine Liu,Heng Chen,Mihajlo Velimirović,Malcolm Reynolds,Priyanka Agrawal,Nick Sukhanov,Abhinit Modi,Siddharth Goyal,John Palowitch,Nima Khajehnouri,Wing Lowe,David Klinghoffer,Sharon Silver,Vinh Tran,Candice Schumann,Francesco Piccinno,Xi Liu,Mario Lučić,Xiaochen Yang,Sandeep Kumar,Ajay Kannan,Ragha Kotikalapudi,Mudit Bansal,Fabian Fuchs,Javad Hosseini,Abdelrahman Abdelhamed,Dawn Bloxwich,Tianhe Yu,Ruoxin Sang,Gregory Thornton,Karan Gill,Yuchi Liu,Virat Shejwalkar,Jason Lin,Zhipeng Yan,Kehang Han,Thomas Buschmann,Michael Pliskin,Zhi Xing,Susheel Tatineni,Junlin Zhang,Sissie Hsiao,Gavin Buttimore,Marcus Wu,Zefei Li,Geza Kovacs,Legg Yeung,Tao Huang,Aaron Cohen,Bethanie Brownfield,Averi Nowak,Mikel Rodriguez,Tianze Shi,Hado van Hasselt,Kevin Cen,Deepanway Ghoshal,Kushal Majmundar,Weiren Yu,Warren,Chen,Danila Sinopalnikov,Hao Zhang,Vlado Galić,Di Lu,Zeyu Zheng,Maggie Song,Gary Wang,Gui Citovsky,Swapnil Gawde,Isaac Galatzer-Levy,David Silver,Ivana Balazevic,Dipanjan Das,Kingshuk Majumder,Yale Cong,Praneet Dutta,Dustin Tran,Hui Wan,Junwei Yuan,Daniel Eppens,Alanna Walton,Been Kim,Harry Ragan,James Cobon-Kerr,Lu Liu,Weijun Wang,Bryce Petrini,Jack Rae,Rakesh Shivanna,Yan Xiong,Chace Lee,Pauline Coquinot,Yiming Gu,Lisa Patel,Blake Hechtman,Aviel Boag,Orion Jankowski,Alex Wertheim,Alex Lee,Paul Covington,Hila Noga,Sam Sobell,Shanthal Vasanth,William Bono,Chirag Nagpal,Wei Fan,Xavier Garcia,Kedar Soparkar,Aybuke Turker,Nathan Howard,Sachit Menon,Yuankai Chen,Vikas Verma,Vladimir Pchelin,Harish Rajamani,Valentin Dalibard,Ana Ramalho,Yang Guo,Kartikeya Badola,Seojin Bang,Nathalie Rauschmayr,Julia Proskurnia,Sudeep Dasari,Xinyun Chen,Mikhail Sushkov,Anja Hauth,Pauline Sho,Abhinav Singh,Bilva Chandra,Allie Culp,Max Dylla,Olivier Bachem,James Besley,Heri Zhao,Timothy Lillicrap,Wei Wei,Wael Al Jishi,Ning Niu,Alban Rrustemi,Raphaël Lopez Kaufman,Ryan Poplin,Jewel Zhao,Minh Truong,Shikhar Bharadwaj,Ester Hlavnova,Eli Stickgold,Cordelia Schmid,Georgi Stephanov,Zhaoqi Leng,Frederick Liu,Léonard Hussenot,Shenil Dodhia,Juliana Vicente Franco,Lesley Katzen,Abhanshu Sharma,Sarah Cogan,Zuguang Yang,Aniket Ray,Sergi Caelles,Shen Yan,Ravin Kumar,Daniel Gillick,Renee Wong,Joshua Ainslie,Jonathan Hoech,Séb Arnold,Dan Abolafia,Anca Dragan,Ben Hora,Grace Hu,Alexey Guseynov,Yang Lu,Chas Leichner,Jinmeng Rao,Abhimanyu Goyal,Nagabhushan Baddi,Daniel Hernandez Diaz,Tim McConnell,Max Bain,Jake Abernethy,Qiqi Yan,Rylan Schaeffer,Paul Vicol,Will Thompson,Montse Gonzalez Arenas,Mathias Bellaiche,Pablo Barrio,Stefan Zinke,Riccardo Patana,Pulkit Mehta,JK Kearns,Avraham Ruderman,Scott Pollom,David D'Ambrosio,Cath Hope,Yang Yu,Andrea Gesmundo,Kuang-Huei Lee,Aviv Rosenberg,Yiqian Zhou,Yaoyiran Li,Drew Garmon,Yonghui Wu,Safeen Huda,Gil Fidel,Martin Baeuml,Jian Li,Phoebe Kirk,Rhys May,Tao Tu,Sara Mc Carthy,Toshiyuki Fukuzawa,Miranda Aperghis,Chih-Kuan Yeh,Toshihiro Yoshino,Bo Li,Austin Myers,Kaisheng Yao,Ben Limonchik,Changwan Ryu,Rohun Saxena,Alex Goldin,Ruizhe Zhao,Rocky Rhodes,Tao Zhu,Divya Tyam,Heidi Howard,Nathan Byrd,Hongxu Ma,Yan Wu,Ryan Mullins,Qingze Wang,Aida Amini,Sebastien Baur,Yiran Mao,Subhashini Venugopalan,Will Song,Wen Ding,Paul Collins,Sashank Reddi,Megan Shum,Andrei Rusu,Luisa Zintgraf,Kelvin Chan,Sheela Goenka,Mathieu Blondel,Michael Collins,Renke Pan,Marissa Giustina,Nikolai Chinaev,Christian Schuler,Ce Zheng,Jonas Valfridsson,Alyssa Loo,Alex Yakubovich,Jamie Smith,Tao Jiang,Rich Munoz,Gabriel Barcik,Rishabh Bansal,Mingyao Yang,Yilun Du,Pablo Duque,Mary Phuong,Alexandra Belias,Kunal Lad,Zeyu Liu,Tal Schuster,Karthik Duddu,Jieru Hu,Paige Kunkle,Matthew Watson,Jackson Tolins,Josh Smith,Denis Teplyashin,Garrett Bingham,Marvin Ritter,Marco Andreetto,Divya Pitta,Mohak Patel,Shashank Viswanadha,Trevor Strohman,Catalin Ionescu,Jincheng Luo,Yogesh Kalley,Jeremy Wiesner,Dan Deutsch,Derek Lockhart,Peter Choy,Rumen Dangovski,Chawin Sitawarin,Cat Graves,Tanya Lando,Joost van Amersfoort,Ndidi Elue,Zhouyuan Huo,Pooya Moradi,Jean Tarbouriech,Henryk Michalewski,Wenting Ye,Eunyoung Kim,Alex Druinsky,Florent Altché,Xinyi Chen,Artur Dwornik,Da-Cheng Juan,Rivka Moroshko,Horia Toma,Jarrod Kahn,Hai Qian,Maximilian Sieb,Irene Cai,Roman Goldenberg,Praneeth Netrapalli,Sindhu Raghuram,Yuan Gong,Lijie Fan,Evan Palmer,Yossi Matias,Valentin Gabeur,Shreya Pathak,Tom Ouyang,Don Metzler,Geoff Bacon,Srinivasan Venkatachary,Sridhar Thiagarajan,Alex Cullum,Eran Ofek,Vytenis Sakenas,Mohamed Hammad,Cesar Magalhaes,Mayank Daswani,Oscar Chang,Ashok Popat,Ruichao Li,Komal Jalan,Yanhan Hou,Josh Lipschultz,Antoine He,Wenhao Jia,Pier Giuseppe Sessa,Prateek Kolhar,William Wong,Sumeet Singh,Lukas Haas,Jay Whang,Hanna Klimczak-Plucińska,Georges Rotival,Grace Chung,Yiqing Hua,Anfal Siddiqui,Nicolas Serrano,Dongkai Chen,Billy Porter,Libin Bai,Keshav Shivam,Sho Arora,Partha Talukdar,Tom Cobley,Sangnie Bhardwaj,Evgeny Gladchenko,Simon Green,Kelvin Guu,Felix Fischer,Xiao Wu,Eric Wang,Achintya Singhal,Tatiana Matejovicova,James Martens,Hongji Li,Roma Patel,Elizabeth Kemp,Jiaqi Pan,Lily Wang,Blake JianHang Chen,Jean-Baptiste Alayrac,Navneet Potti,Erika Gemzer,Eugene Ie,Kay McKinney,Takaaki Saeki,Edward Chou,Pascal Lamblin,SQ Mah,Zach Fisher,Martin Chadwick,Jon Stritar,Obaid Sarvana,Andrew Hogue,Artem Shtefan,Hadi Hashemi,Yang Xu,Jindong Gu,Sharad Vikram,Chung-Ching Chang,Sabela Ramos,Logan Kilpatrick,Weijuan Xi,Jenny Brennan,Yinghao Sun,Abhishek Jindal,Ionel Gog,Dawn Chen,Felix Wu,Jason Lee,Sudhindra Kopalle,Srinadh Bhojanapalli,Oriol Vinyals,Natan Potikha,Burcu Karagol Ayan,Yuan Yuan,Michael Riley,Piotr Stanczyk,Sergey Kishchenko,Bing Wang,Dan Garrette,Antoine Yang,Vlad Feinberg,CJ Carey,Javad Azizi,Viral Shah,Erica Moreira,Chongyang Shi,Josh Feldman,Elizabeth Salesky,Thomas Lampe,Aneesh Pappu,Duhyeon Kim,Jonas Adler,Avi Caciularu,Brian Walker,Yunhan Xu,Yochai Blau,Dylan Scandinaro,Terry Huang,Sam El-Husseini,Abhishek Sinha,Lijie Ren,Taylor Tobin,Patrik Sundberg,Tim Sohn,Vikas Yadav,Mimi Ly,Emily Xue,Jing Xiong,Afzal Shama Soudagar,Sneha Mondal,Nikhil Khadke,Qingchun Ren,Ben Vargas,Stan Bileschi,Sarah Chakera,Cindy Wang,Boyu Wang,Yoni Halpern,Joe Jiang,Vikas Sindhwani,Petre Petrov,Pranavaraj Ponnuramu,Sanket Vaibhav Mehta,Yu Watanabe,Betty Chan,Matheus Wisniewski,Trang Pham,Jingwei Zhang,Conglong Li,Dario de Cesare,Art Khurshudov,Alex Vasiloff,Melissa Tan,Zoe Ashwood,Bobak Shahriari,Maryam Majzoubi,Garrett Tanzer,Olga Kozlova,Robin Alazard,James Lee-Thorp,Nguyet Minh Phu,Isaac Tian,Junwhan Ahn,Andy Crawford,Lauren Lax,Yuan,Shangguan,Iftekhar Naim,David Ross,Oleksandr Ferludin,Tongfei Guo,Andrea Banino,Hubert Soyer,Xiaoen Ju,Dominika Rogozińska,Ishaan Malhi,Marcella Valentine,Daniel Balle,Apoorv Kulshreshtha,Maciej Kula,Yiwen Song,Sophia Austin,John Schultz,Roy Hirsch,Arthur Douillard,Apoorv Reddy,Michael Fink,Summer Yue,Khyatti Gupta,Adam Zhang,Norman Rink,Daniel McDuff,Lei Meng,András György,Yasaman Razeghi,Ricky Liang,Kazuki Osawa,Aviel Atias,Matan Eyal,Tyrone Hill,Nikolai Grigorev,Zhengdong Wang,Nitish Kulkarni,Rachel Soh,Ivan Lobov,Zachary Charles,Sid Lall,Kazuma Hashimoto,Ido Kessler,Victor Gomes,Zelda Mariet,Danny Driess,Alessandro Agostini,Canfer Akbulut,Jingcao Hu,Marissa Ikonomidis,Emily Caveness,Kartik Audhkhasi,Saurabh Agrawal,Ioana Bica,Evan Senter,Jayaram Mudigonda,Kelly Chen,Jingchen Ye,Xuanhui Wang,James Svensson,Philipp Fränken,Josh Newlan,Li Lao,Eva Schnider,Sami Alabed,Joseph Kready,Jesse Emond,Afief Halumi,Tim Zaman,Chengxi Ye,Naina Raisinghani,Vilobh Meshram,Bo Chang,Ankit Singh Rawat,Axel Stjerngren,Sergey Levi,Rui Wang,Xiangzhu Long,Mitchelle Rasquinha,Steven Hand,Aditi Mavalankar,Lauren Agubuzu,Sudeshna Roy,Junquan Chen,Jarek Wilkiewicz,Hao Zhou,Michal Jastrzebski,Qiong Hu,Agustin Dal Lago,Ramya Sree Boppana,Wei-Jen Ko,Jennifer Prendki,Yao Su,Zhi Li,Eliza Rutherford,Girish Ramchandra Rao,Ramona Comanescu,Adrià Puigdomènech,Qihang Chen,Dessie Petrova,Christine Chan,Vedrana Milutinovic,Felipe Tiengo Ferreira,Chin-Yi Cheng,Ming Zhang,Tapomay Dey,Sherry Yang,Ramesh Sampath,Quoc Le,Howard Zhou,Chu-Cheng Lin,Hoi Lam,Christine Kaeser-Chen,Kai Hui,Dean Hirsch,Tom Eccles,Basil Mustafa,Shruti Rijhwani,Morgane Rivière,Yuanzhong Xu,Junjie Wang,Xinyang Geng,Xiance Si,Arjun Khare,Cheolmin Kim,Vahab Mirrokni,Kamyu Lee,Khuslen Baatarsukh,Nathaniel Braun,Lisa Wang,Pallavi LV,Richard Tanburn,Yuvein,Zhu,Fangda Li,Setareh Ariafar,Dan Goldberg,Ken Burke,Daniil Mirylenka,Meiqi Guo,Olaf Ronneberger,Hadas Natalie Vogel,Liqun Cheng,Nishita Shetty,Johnson Jia,Thomas Jimma,Corey Fry,Ted Xiao,Martin Sundermeyer,Ryan Burnell,Yannis Assael,Mario Pinto,JD Chen,Rohit Sathyanarayana,Donghyun Cho,Jing Lu,Rishabh Agarwal,Sugato Basu,Lucas Gonzalez,Dhruv Shah,Meng Wei,Dre Mahaarachchi,Rohan Agrawal,Tero Rissa,Yani Donchev,Ramiro Leal-Cavazos,Adrian Hutter,Markus Mircea,Alon Jacovi,Faruk Ahmed,Jiageng Zhang,Shuguang Hu,Bo-Juen Chen,Jonni Kanerva,Guillaume Desjardins,Andrew Lee,Nikos Parotsidis,Asier Mujika,Tobias Weyand,Jasper Snoek,Jo Chick,Kai Chen,Paul Chang,Ethan Mahintorabi,Zi Wang,Tolly Powell,Orgad Keller,Abhirut Gupta,Claire Sha,Kanav Garg,Nicolas Heess,Ágoston Weisz,Cassidy Hardin,Bartek Wydrowski,Ben Coleman,Karina Zainullina,Pankaj Joshi,Alessandro Epasto,Terry Spitz,Binbin Xiong,Kai Zhao,Arseniy Klimovskiy,Ivy Zheng,Johan Ferret,Itay Yona,Waleed Khawaja,Jean-Baptiste Lespiau,Maxim Krikun,Siamak Shakeri,Timothee Cour,Bonnie Li,Igor Krivokon,Dan Suh,Alex Hofer,Jad Al Abdallah,Nikita Putikhin,Oscar Akerlund,Silvio Lattanzi,Anurag Kumar,Shane Settle,Himanshu Srivastava,Folawiyo Campbell-Ajala,Edouard Rosseel,Mihai Dorin Istin,Nishanth Dikkala,Anand Rao,Nick Young,Kate Lin,Dhruva Bhaswar,Yiming Wang,Jaume Sanchez Elias,Kritika Muralidharan,James Keeling,Dayou Du,Siddharth Gopal,Gregory Dibb,Charles Blundell,Manolis Delakis,Jacky Liang,Marco Tulio Ribeiro,Georgi Karadzhov,Guillermo Garrido,Ankur Bapna,Jiawei Cao,Adam Sadovsky,Pouya Tafti,Arthur Guez,Coline Devin,Yixian Di,Jinwei Xing,Chuqiao,Xu,Hanzhao Lin,Chun-Te Chu,Sameera Ponda,Wesley Helmholz,Fan Yang,Yue Gao,Sara Javanmardi,Wael Farhan,Alex Ramirez,Ricardo Figueira,Khe Chai Sim,Yuval Bahat,Ashwin Vaswani,Liangzhe Yuan,Gufeng Zhang,Leland Rechis,Hanjun Dai,Tayo Oguntebi,Alexandra Cordell,Eugénie Rives,Kaan Tekelioglu,Naveen Kumar,Bing Zhang,Aurick Zhou,Nikolay Savinov,Andrew Leach,Alex Tudor,Sanjay Ganapathy,Yanyan Zheng,Mirko Rossini,Vera Axelrod,Arnaud Autef,Yukun Zhu,Zheng Zheng,Mingda Zhang,Baochen Sun,Jie Ren,Nenad Tomasev,Nithish Kannan,Amer Sinha,Charles Chen,Louis O'Bryan,Alex Pak,Aditya Kusupati,Weel Yang,Deepak Ramachandran,Patrick Griffin,Seokhwan Kim,Philipp Neubeck,Craig Schiff,Tammo Spalink,Mingyang Ling,Arun Nair,Ga-Young Joung,Linda Deng,Avishkar Bhoopchand,Lora Aroyo,Tom Duerig,Jordan Griffith,Gabe Barth-Maron,Jake Ades,Alex Haig,Ankur Taly,Yunting Song,Paul Michel,Dave Orr,Dean Weesner,Corentin Tallec,Carrie Grimes Bostock,Paul Niemczyk,Andy Twigg,Mudit Verma,Rohith Vallu,Henry Wang,Marco Gelmi,Kiranbir Sodhia,Aleksandr Chuklin,Omer Goldman,Jasmine George,Liang Bai,Kelvin Zhang,Petar Sirkovic,Efrat Nehoran,Golan Pundak,Jiaqi Mu,Alice Chen,Alex Greve,Paulo Zacchello,David Amos,Heming Ge,Eric Noland,Colton Bishop,Jeffrey Dudek,Youhei Namiki,Elena Buchatskaya,Jing Li,Dorsa Sadigh,Masha Samsikova,Dan Malkin,Damien Vincent,Robert David,Rob Willoughby,Phoenix Meadowlark,Shawn Gao,Yan Li,Raj Apte,Amit Jhindal,Stein Xudong Lin,Alex Polozov,Zhicheng Wang,Tomas Mery,Anirudh GP,Varun Yerram,Sage Stevens,Tianqi Liu,Noah Fiedel,Charles Sutton,Matthew Johnson,Xiaodan Song,Kate Baumli,Nir Shabat,Muqthar Mohammad,Hao Liu,Marco Selvi,Yichao Zhou,Mehdi Hafezi Manshadi,Chu-ling Ko,Anthony Chen,Michael Bendersky,Jorge Gonzalez Mendez,Nisarg Kothari,Amir Zandieh,Yiling Huang,Daniel Andor,Ellie Pavlick,Idan Brusilovsky,Jitendra Harlalka,Sally Goldman,Andrew Lampinen,Guowang Li,Asahi Ushio,Somit Gupta,Lei Zhang,Chuyuan Kelly Fu,Madhavi Sewak,Timo Denk,Jed Borovik,Brendan Jou,Avital Zipori,Prateek Jain,Junwen Bai,Thang Luong,Jonathan Tompson,Alice Li,Li Liu,George Powell,Jiajun Shen,Alex Feng,Grishma Chole,Da Yu,Yinlam Chow,Tongxin Yin,Eric Malmi,Kefan Xiao,Yash Pande,Shachi Paul,Niccolò Dal Santo,Adil Dostmohamed,Sergio Guadarrama,Aaron Phillips,Thanumalayan Sankaranarayana Pillai,Gal Yona,Amin Ghafouri,Preethi Lahoti,Benjamin Lee,Dhruv Madeka,Eren Sezener,Simon Tokumine,Adrian Collister,Nicola De Cao,Richard Shin,Uday Kalra,Parker Beak,Emily Nottage,Ryo Nakashima,Ivan Jurin,Vikash Sehwag,Meenu Gaba,Junhao Zeng,Kevin R. McKee,Fernando Pereira,Tamar Yakar,Amayika Panda,Arka Dhar,Peilin Zhong,Daniel Sohn,Mark Brand,Lars Lowe Sjoesund,Viral Carpenter,Sharon Lin,Shantanu Thakoor,Marcus Wainwright,Ashwin Chaugule,Pranesh Srinivasan,Muye Zhu,Bernett Orlando,Jack Weber,Ayzaan Wahid,Gilles Baechler,Apurv Suman,Jovana Mitrović,Gabe Taubman,Honglin Yu,Helen King,Josh Dillon,Cathy Yip,Dhriti Varma,Tomas Izo,Levent Bolelli,Borja De Balle Pigem,Julia Di Trapani,Fotis Iliopoulos,Adam Paszke,Nishant Ranka,Joe Zou,Francesco Pongetti,Jed McGiffin,Alex Siegman,Rich Galt,Ross Hemsley,Goran Žužić,Victor Carbune,Tao Li,Myle Ott,Félix de Chaumont Quitry,David Vilar Torres,Yuri Chervonyi,Tomy Tsai,Prem Eruvbetine,Samuel Yang,Matthew Denton,Jake Walker,Slavica Andačić,Idan Heimlich Shtacher,Vittal Premachandran,Harshal Tushar Lehri,Cip Baetu,Damion Yates,Lampros Lamprou,Mariko Iinuma,Ioana Mihailescu,Ben Albrecht,Shachi Dave,Susie Sargsyan,Bryan Perozzi,Lucas Manning,Chiyuan Zhang,Denis Vnukov,Igor Mordatch,Raia Hadsell Wolfgang Macherey,Ryan Kappedal,Jim Stephan,Aditya Tripathi,Klaus Macherey,Jun Qian,Abhishek Bhowmick,Shekoofeh Azizi,Rémi Leblond,Shiva Mohan Reddy Garlapati,Timothy Knight,Matthew Wiethoff,Wei-Chih Hung,Anelia Angelova,Georgios Evangelopoulos,Pawel Janus,Dimitris Paparas,Matthew Rahtz,Ken Caluwaerts,Vivek Sampathkumar,Daniel Jarrett,Shadi Noghabi,Antoine Miech,Chak Yeung,Geoff Clark,Henry Prior,Fei Zheng,Jean Pouget-Abadie,Indro Bhattacharya,Kalpesh Krishna,Will Bishop,Zhe Yuan,Yunxiao Deng,Ashutosh Sathe,Kacper Krasowiak,Ciprian Chelba,Cho-Jui Hsieh,Kiran Vodrahalli,Buhuang Liu,Thomas Köppe,Amr Khalifa,Lubo Litchev,Pichi Charoenpanit,Reed Roberts,Sachin Yadav,Yasumasa Onoe,Desi Ivanov,Megha Mohabey,Vighnesh Birodkar,Nemanja Rakićević,Pierre Sermanet,Vaibhav Mehta,Krishan Subudhi,Travis Choma,Will Ng,Luheng He,Kathie Wang,Tasos Kementsietsidis,Shane Gu,Mansi Gupta,Andrew Nystrom,Mehran Kazemi,Timothy Chung,Nacho Cano,Nikhil Dhawan,Yufei Wang,Jiawei Xia,Trevor Yacovone,Eric Jia,Mingqing Chen,Simeon Ivanov,Ashrith Sheshan,Sid Dalmia,Paweł Stradomski,Pengcheng Yin,Salem Haykal,Congchao Wang,Dennis Duan,Neslihan Bulut,Greg Kochanski,Liam MacDermed,Namrata Godbole,Shitao Weng,Jingjing Chen,Rachana Fellinger,Ramin Mehran,Daniel Suo,Hisham Husain,Tong He,Kaushal Patel,Joshua Howland,Randall Parker,Kelvin Nguyen,Sharath Maddineni,Chris Rawles,Mina Khan,Shlomi Cohen-Ganor,Amol Mandhane,Xinyi Wu,Chenkai Kuang,Iulia Comşa,Ramya Ganeshan,Hanie Sedghi,Adam Bloniarz,Nuo Wang Pierse,Anton Briukhov,Petr Mitrichev,Anita Gergely,Serena Zhan,Allan Zhou,Nikita Saxena,Eva Lu,Josef Dean,Ashish Gupta,Nicolas Perez-Nieves,Renjie Wu,Cory McLean,Wei Liang,Disha Jindal,Anton Tsitsulin,Wenhao Yu,Kaiz Alarakyia,Tom Schaul,Piyush Patil,Peter Sung,Elijah Peake,Hongkun Yu,Feryal Behbahani,JD Co-Reyes,Alan Ansell,Sean Sun,Clara Barbu,Jonathan Lee,Seb Noury,James Allingham,Bilal Piot,Mohit Sharma,Christopher Yew,Ivan Korotkov,Bibo Xu,Demetra Brady,Goran Petrovic,Shibl Mourad,Claire Cui,Aditya Gupta,Parker Schuh,Saarthak Khanna,Anna Goldie,Abhinav Arora,Vadim Zubov,Amy Stuart,Mark Epstein,Yun Zhu,Jianqiao Liu,Yury Stuken,Ziyue Wang,Karolis Misiunas,Dee Guo,Ashleah Gill,Ale Hartman,Zaid Nabulsi,Aurko Roy,Aleksandra Faust,Jason Riesa,Ben Withbroe,Mengchao Wang,Marco Tagliasacchi,Andreea Marzoca,James Noraky,Serge Toropov,Malika Mehrotra,Bahram Raad,Sanja Deur,Steve Xu,Marianne Monteiro,Zhongru Wu,Yi Luan,Sam Ritter,Nick Li,Håvard Garnes,Yanzhang He,Martin Zlocha,Jifan Zhu,Matteo Hessel,Will Wu,Spandana Raj Babbula,Chizu Kawamoto,Yuanzhen Li,Mehadi Hassen,Yan Wang,Brian Wieder,James Freedman,Yin Zhang,Xinyi Bai,Tianli Yu,David Reitter,XiangHai Sheng,Mateo Wirth,Aditya Kini,Dima Damen,Mingcen Gao,Rachel Hornung,Michael Voznesensky,Brian Roark,Adhi Kuncoro,Yuxiang Zhou,Rushin Shah,Anthony Brohan,Kuangyuan Chen,James Wendt,David Rim,Paul Kishan Rubenstein,Jonathan Halcrow,Michelle Liu,Ty Geri,Yunhsuan Sung,Jane Shapiro,Shaan Bijwadia,Chris Duvarney,Christina Sorokin,Paul Natsev,Reeve Ingle,Pramod Gupta,Young Maeng,Ndaba Ndebele,Kexin Zhu,Valentin Anklin,Katherine Lee,Yuan Liu,Yaroslav Akulov,Shaleen Gupta,Guolong Su,Flavien Prost,Tianlin Liu,Vitaly Kovalev,Pol Moreno,Martin Scholz,Sam Redmond,Zongwei Zhou,Alex Castro-Ros,André Susano Pinto,Dia Kharrat,Michal Yarom,Rachel Saputro,Jannis Bulian,Ben Caine,Ji Liu,Abbas Abdolmaleki,Shariq Iqbal,Tautvydas Misiunas,Mikhail Sirotenko,Shefali Garg,Guy Bensky,Huan Gui,Xuezhi Wang,Raphael Koster,Mike Bernico,Da Huang,Romal Thoppilan,Trevor Cohn,Ben Golan,Wenlei Zhou,Andrew Rosenberg,Markus Freitag,Tynan Gangwani,Vincent Tsang,Anand Shukla,Xiaoqi Ren,Minh Giang,Chi Zou,Andre Elisseeff,Charline Le Lan,Dheeru Dua,Shuba Lall,Pranav Shyam,Frankie Garcia,Sarah Nguyen,Michael Guzman,AJ Maschinot,Marcello Maggioni,Ming-Wei Chang,Karol Gregor,Lotte Weerts,Kumaran Venkatesan,Bogdan Damoc,Leon Liu,Jan Wassenberg,Lewis Ho,Becca Roelofs,Majid Hadian,François-Xavier Aubet,Yu Liang,Sami Lachgar,Danny Karmon,Yong Cheng,Amelio Vázquez-Reina,Angie Chen,Zhuyun Dai,Andy Brock,Shubham Agrawal,Chenxi Pang,Peter Garst,Mariella Sanchez-Vargas,Ivor Rendulic,Aditya Ayyar,Andrija Ražnatović,Olivia Ma,Roopali Vij,Neha Sharma,Ashwin Balakrishna,Bingyuan Liu,Ian Mackinnon,Sorin Baltateanu,Petra Poklukar,Gabriel Ibagon,Colin Ji,Hongyang Jiao,Isaac Noble,Wojciech Stokowiec,Zhihao Li,Jeff Dean,David Lindner,Mark Omernick,Kristen Chiafullo,Mason Dimarco,Vitor Rodrigues,Vittorio Selo,Garrett Honke,Xintian,Wu,Wei He,Adam Hillier,Anhad Mohananey,Vihari Piratla,Chang Ye,Chase Malik,Sebastian Riedel,Samuel Albanie,Zi Yang,Kenny Vassigh,Maria Bauza,Sheng Li,Yiqing Tao,Nevan Wichers,Andrii Maksai,Abe Ittycheriah,Ross Mcilroy,Bryan Seybold,Noah Goodman,Romina Datta,Steven M. Hernandez,Tian Shi,Yony Kochinski,Anna Bulanova,Ken Franko,Mikita Sazanovich,Nicholas FitzGerald,Praneeth Kacham,Shubha Srinivas Raghvendra,Vincent Hellendoorn,Alexander Grushetsky,Julian Salazar,Angeliki Lazaridou,Jason Chang,Jan-Thorsten Peter,Sushant Kafle,Yann Dauphin,Abhishek Rao,Filippo Graziano,Izhak Shafran,Yuguo Liao,Tianli Ding,Geng Yan,Grace Chu,Zhao Fu,Vincent Roulet,Gabriel Rasskin,Duncan Williams,Shahar Drath,Alex Mossin,Raphael Hoffmann,Jordi Orbay,Francesco Bertolini,Hila Sheftel,Justin Chiu,Siyang Xue,Yuheng Kuang,Ferjad Naeem,Swaroop Nath,Nana Nti,Phil Culliton,Kashyap Krishnakumar,Michael Isard,Pei Sun,Ayan Chakrabarti,Nathan Clement,Regev Cohen,Arissa Wongpanich,GS Oh,Ashwin Murthy,Hao Zheng,Jessica Hamrick,Oskar Bunyan,Suhas Ganesh,Nitish Gupta,Roy Frostig,John Wieting,Yury Malkov,Pierre Marcenac,Zhixin,Lai,Xiaodan Tang,Mohammad Saleh,Fedir Zubach,Chinmay Kulkarni,Huanjie Zhou,Vicky Zayats,Nan Ding,Anshuman Tripathi,Arijit Pramanik,Patrik Zochbauer,Harish Ganapathy,Vedant Misra,Zach Behrman,Hugo Vallet,Mingyang Zhang,Mukund Sridhar,Ye Jin,Mohammad Babaeizadeh,Siim Põder,Megha Goel,Divya Jain,Tajwar Nasir,Shubham Mittal,Tim Dozat,Diego Ardila,Aliaksei Severyn,Fabio Pardo,Sammy Jerome,Siyang Qin,Louis Rouillard,Amir Yazdanbakhsh,Zizhao Zhang,Shivani Agrawal,Kaushik Shivakumar,Caden Lu,Praveen Kallakuri,Rachita Chhaparia,Kanishka Rao,Charles Kwong,Asya Fadeeva,Shitij Nigam,Yan Virin,Yuan Zhang,Balaji Venkatraman,Beliz Gunel,Marc Wilson,Huiyu Wang,Abhinav Gupta,Xiaowei Xu,Adrien Ali Taïga,Kareem Mohamed,Doug Fritz,Daniel Rodriguez,Zoubin Ghahramani,Harry Askham,Lior Belenki,James Zhao,Rahul Gupta,Krzysztof Jastrzębski,Takahiro Kosakai,Kaan Katircioglu,Jon Schneider,Rina Panigrahy,Konstantinos Bousmalis,Peter Grabowski,Prajit Ramachandran,Chaitra Hegde,Mihaela Rosca,Angelo Scorza Scarpati,Kyriakos Axiotis,Ying Xu,Zach Gleicher,Assaf Hurwitz Michaely,Mandar Sharma,Sanil Jain,Christoph Hirnschall,Tal Marian,Xuhui Jia,Kevin Mather,Kilol Gupta,Linhai Qiu,Nigamaa Nayakanti,Lucian Ionita,Steven Zheng,Lucia Loher,Kurt Shuster,Igor Petrovski,Roshan Sharma,Rahma Chaabouni,Angel Yeh,James An,Arushi Gupta,Steven Schwarcz,Seher Ellis,Sam Conway-Rahman,Javier Snaider,Alex Zhai,James Atwood,Daniel Golovin,Liqian Peng,Te I,Vivian Xia,Salvatore Scellato,Mahan Malihi,Arthur Bražinskas,Vlad-Doru Ion,Younghoon Jun,James Swirhun,Soroosh Mariooryad,Jiao Sun,Steve Chien,Rey Coaguila,Ariel Brand,Yi Gao,Tom Kwiatkowski,Roee Aharoni,Cheng-Chun Lee,Mislav Žanić,Yichi Zhang,Dan Ethier,Vitaly Nikolaev,Pranav Nair,Yoav Ben Shalom,Hen Fitoussi,Jai Gupta,Hongbin Liu,Dee Cattle,Tolga Bolukbasi,Ben Murdoch,Fantine Huot,Yin Li,Chris Hahn

Main category: cs.CL

TL;DR: 该论文介绍了Gemini 2.X模型系列,包括Gemini 2.5 Pro、Gemini 2.5 Flash、Gemini 2.0 Flash和Flash-Lite模型。

Details Motivation: 为了提供一系列在能力与成本之间取得平衡的模型,以探索复杂代理问题解决的可能性。 Method: 通过开发Gemini 2.X模型家族的不同版本,包括具有强大功能的Gemini 2.5 Pro、高效推理能力的Gemini 2.5 Flash以及低延迟低成本的Gemini 2.0 Flash和Flash-Lite模型。 Result: Gemini 2.5 Pro在编码和推理基准测试中达到了最先进的性能,并能处理长达3小时的视频内容;Gemini 2.5 Flash提供了优秀的推理能力但计算和延迟需求较低;Gemini 2.0 Flash和Flash-Lite则在低延迟和低成本下提供高性能。 Conclusion: Gemini 2.X模型系列覆盖了模型能力与成本之间的完整帕累托前沿,为用户探索复杂代理问题解决的新可能性。 Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

[2] Humans overrely on overconfident language models, across languages

Neil Rathi,Dan Jurafsky,Kaitlyn Zhou

Main category: cs.CL

TL;DR: 该论文研究了多语言环境下大语言模型的过度自信问题,发现不同语言中都存在高风险的依赖行为。

Details Motivation: 随着大语言模型在全球范围部署,其在多语言环境下的不确定性表达需准确,以避免用户过度依赖。 Method: 分析LLMs生成的认知标记分布,并测量不同语言下的用户依赖率。 Result: 跨语言的过度依赖风险都很高;例如,日语中生成最多不确定性标记,而德语和汉语中则最多确定性标记。 Conclusion: 研究强调了多语言模型校准的挑战,并指出需要根据文化和语言背景进行模型安全性评估。 Abstract: As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'It's definitely,' 'I think') can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

[3] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia,Zahra Golpayegani,Walid Ahmad,Yang Liu

Main category: cs.CL

TL;DR: This paper introduces ETT, a method to efficiently extend context length in Transformer-based LLMs, achieving better accuracy with manageable computational costs.

Details Motivation: Transformer-based Language Models face quadratic increases in computation and memory overhead as sequence length increases. This poses challenges when using LLMs for processing long sequences. Method: ETT enables extension of context length at test-time through efficient fine-tuning on input context split into overlapping small subsequences. Result: Evaluation on LongBench showed that ETT can extend the context length of GPT-Large and Phi-2 up to 32 times (from 1k to 32k tokens), resulting in up to a 30 percent improvement in model accuracy. Fine-tuning the second layer of FFNs was found to be more effective than full fine-tuning. Conclusion: The paper concludes that ETT is an effective method for extending the context length of short context Transformer-based LLMs with constant memory requirement and linear computation overhead, leading to improved model accuracy. Abstract: Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.

[4] Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Casey Kennington,David Schlangen

Main category: cs.CL

TL;DR: 这篇论文探讨了如何利用words-as-classifiers模型将形式化、分布和基础语义学理论统一起来,以克服各自的局限性并提升语言模型的效果。

Details Motivation: 该研究的动机是由于形式化、分布和基础语义学理论各自具有优缺点,因此需要一种方法将三者的优势结合起来,推动语言模型的发展。 Method: 这篇论文采用了文献综述的方法,并进行了一项小型实验来支持words-as-classifiers模型的有效性。 Result: 论文的结果表明,words-as-classifiers模型已经在交互式对话设置中得到良好测试,并且可以有效地整合到形式化和分布式的语言模型中。 Conclusion: 该论文得出的结论是,words-as-classifiers模型为统一形式、分布和基础语义学领域提供了一个潜在的前进路径,并通过文献综述和实验对该模型进行了验证。 Abstract: Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.

[5] Evaluating Morphological Alignment of Tokenizers in 70 Languages

Catherine Arnett,Marisa Hudspeth,Brendan O'Connor

Main category: cs.CL

TL;DR: Expanding MorphScore to 70 languages, this study finds that morphological alignment of tokenizers has limited impact on downstream model performance.

Details Motivation: Tokenization is a key step in language modeling, and there is interest in understanding how well tokenizers preserve linguistically meaningful subwords. However, the impact of morphological alignment on model performance remains unclear. Method: The study expanded MorphScore to cover 70 languages and correlated alignment scores with downstream task performance across five pre-trained language models and seven tasks. Result: Morphological alignment was found to explain very little variance in model performance across the evaluated tasks and languages. Conclusion: Morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance. Abstract: While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

[6] Hypermagmas and Colored Operads: Heads, Phases, and Theta Roles

Matilde Marcolli,Riny Huijbregts,Richard K. Larson

Main category: cs.CL

TL;DR: This paper proposes an algebraic framework using hypermagmas and colored operads to unify and formalize syntactic structures, including phase organization, movement rules, and theta roles.

Details Motivation: The motivation stems from seeking a deeper structural and algebraic understanding of syntactic operations in linguistics, aiming to unify seemingly disparate syntactic phenomena under a coherent mathematical framework. Method: The authors utilize algebraic structures such as magmas and hypermagmas alongside colored operads to model syntactic constructions. They analyze the compatibility of c-command and m-command relations with these structures and translate syntactic principles into the form of operad generators and filtering rules. Result: The study demonstrates that syntactic objects, head functions, and movement rules can be formalized using hypermagmas and colored operads, revealing deep connections between syntactic structure formation, phase impenetrability, and theta role assignment. Conclusion: The paper concludes that syntactic structures and movement rules can be effectively modeled using hypermagmas, colored operads, and colored Merge, providing a unified framework for understanding the Extended Projection Principle, phase structure, and theta role assignments. Abstract: We show that head functions on syntactic objects extend the magma structure to a hypermagma, with the c-command relation compatible with the magma operation and the m-command relation with the hypermagma. We then show that the structure of head and complement and specifier, additional modifier positions, and the structure of phases in the Extended Projection can be formulated as a bud generating system of a colored operad, in a form similar to the structure of theta roles. We also show that, due to the special form of the colored operad generators, the filtering of freely generated syntactic objects by these coloring rules can be equivalently formulated as a filtering in the course of structure formation via a colored Merge, which can in turn be related to the hypermagma structure. The rules on movement by Internal Merge with respect to phases, the Extended Projection Principle, Empty Category Principle, and Phase Impenetrability Condition are all subsumed into the form of the colored operad generators. Movement compatibilities between the phase structure and the theta roles assignments can then be formulated in terms of the respective colored operads and a transduction of colored operads.

[7] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Zeming Chen,Angelika Romanou,Gail Weiss,Antoine Bosselut

Main category: cs.CL

TL;DR: 本研究提出了一种可扩展的方法PERK,通过梯度更新轻量级模型适配器来高效地对长输入上下文进行编码,从而解决长上下文推理问题。

Details Motivation: 先前的研究表明,测试时学习将上下文直接编码到模型参数中可以有效实现在嘈杂信息中的推理。然而,用于实现测试时学习的元学习方法内存消耗过大,阻碍了其在长上下文场景中的应用。 Method: PERK使用两个嵌套的优化循环进行元训练,内部循环将上下文快速编码为低秩适配器(LoRA),外部循环学习利用更新的适配器从编码的长上下文中准确回忆和推理相关信息。 Result: PERK显著优于标准的基于提示的长上下文基线,在多个长上下文推理任务中,对于较小的模型(如GPT-2)平均绝对性能提升高达90%,对于最大的评估模型(如Qwen-2.5-0.5B)性能提升可达27%。 Conclusion: PERK是一种有效的长上下文推理方法,在推理复杂性、长度外推性和上下文相关信息的位置方面更加稳健。 Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

[8] Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Pankayaraj Pathmanathan,Furong Huang

Main category: cs.CL

TL;DR: This paper introduces REFORM, a new method for enhancing reward model robustness through self-improvement and adversarial example generation.

Details Motivation: Existing reward models often fail under distributional shifts or adversarial perturbations due to limited dataset coverage and reliance on prior knowledge about preference distributions. Method: REFORM uses a self-improving reward modeling framework that leverages the reward model itself to generate adversarial examples for training data augmentation. Result: REFORM significantly improves robustness without sacrificing reward quality, showing effectiveness on the HH and PKU Beavertails datasets. Conclusion: The proposed REFORM framework effectively enhances the robustness of reward models, preserving performance and improving alignment quality by removing spurious correlations. Abstract: Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

[9] Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Shun Wang,Tyler Loakman,Youbo Lei,Yi Liu,Bohao Yang,Yuting Zhao,Dong Yang,Chenghua Lin

Main category: cs.CL

TL;DR: 通过分解LLM并重构提示,提高了模型可解释性和下游任务表现。

Details Motivation: 传统LLM被视为黑盒算法,缺乏信任度并阻碍了性能提升。 Method: 使用字典学习与稀疏自编码器对LLM进行分解,并识别模型内部误解。 Result: 在数学推理和隐喻检测等任务中表现出显著的性能提升。 Conclusion: 该方法通过提取单义特征和自动重构提示,提升了LLM在下游任务中的表现。 Abstract: Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

[10] Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling

Rafiu Adekoya Badekale,Adewale Akinfaderin

Main category: cs.CL

TL;DR: This paper uses DETM to analyze the evolution of global climate policy discourse through UNFCCC decisions, revealing key thematic transitions over time and demonstrating the effectiveness of unsupervised machine learning in high-volume policy analysis.

Details Motivation: Understanding how policy language evolves over time is essential for evaluating past priorities, identifying emerging themes, and designing governance strategies for global challenges like climate change. Method: Dynamic Embedded Topic Model (DETM) was applied to analyze the evolution of climate policy discourse using UNFCCC policy decisions from 1995 to 2023. The modeling pipeline included preprocessing, training, and visualization of temporal word distributions. Result: The analysis revealed shifts in focus from greenhouse gases and international conventions to implementation, technical collaboration, capacity building, finance, and global agreements. Conclusion: DETM proves to be a scalable and effective tool for analyzing the evolution of global policy discourse, with potential future applications in other policy domains. Abstract: Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.

[11] Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang,Xuehang Guo,Sofia Stoica,Haiyang Xu,Hongru Wang,Hyeonjeong Ha,Xiusi Chen,Yangyi Chen,Ming Yan,Fei Huang,Heng Ji

Main category: cs.CL

TL;DR: This paper proposes PAPO, an effective RL method that integrates perception-aware supervision to enhance multimodal reasoning in LLMs without extra data or models.

Details Motivation: Current RLVR methods are tailored for textual domains and perform suboptimally in multimodal reasoning due to issues in visual input perception. Method: The paper proposes Perception-Aware Policy Optimization (PAPO), an extension of GRPO, which introduces an Implicit Perception Loss to improve perceptual capabilities in multimodal reasoning tasks. Result: PAPO achieves significant improvements on multimodal benchmarks, with a 4.4% overall improvement and up to 8.0% on vision-dependent tasks, along with a 30.5% reduction in perception errors. Conclusion: The paper concludes that integrating perception-aware supervision into RLVR objectives using PAPO enhances visually grounded reasoning without requiring additional data or models. Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

[12] A Semantic Parsing Framework for End-to-End Time Normalization

Xin Su,Sungduk Yu,Phillip Howard,Steven Bethard

Main category: cs.CL

TL;DR: 本文引入一种新的时间归一化方法,通过SCATE框架和代码生成,实现高精度、可解释的时间信息处理。

Details Motivation: 传统基于ISO-TimeML的方法在处理复杂时间表达式时存在局限性,需要更灵活、更具表现力的解决方案。 Method: 将时间归一化任务转化为代码生成问题,使用SCATE框架并通过LLM生成可执行代码,结合数据增强策略训练小型模型。 Result: 实验表明,基于增强数据训练的小型模型性能优异,甚至超过大型语言模型,实现了高效且可解释的时间归一化。 Conclusion: 本文提出了一种基于SCATE框架的时间归一化新方法,并证明了其在准确性和实用性方面优于传统系统。 Abstract: Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.

[13] A Systematic Analysis of Hybrid Linear Attention

Dustin Wang,Rui-Jie Zhu,Steven Abreu,Yong Shan,Taylor Kergan,Yuqi Pan,Yuhong Chou,Zheng Li,Ge Zhang,Wenhao Huang,Jason Eshraghian

Main category: cs.CL

TL;DR: 本文系统评估了多种线性注意力模型在混合架构中的表现,发现增加全注意力层可显著提升回忆性能,并推荐使用特定架构在线性与全注意力之间保持适当比例以实现高效回忆。

Details Motivation: Transformer在处理长序列时面临二次复杂度和内存问题,因此采用固定大小隐藏状态的线性注意力机制。然而,线性模型在回忆性能上通常有限,导致研究人员尝试结合线性与全注意力机制的混合架构。尽管对混合架构进行了广泛研究,但线性注意力组件的选择尚未深入探索。 Method: 训练并开源了72个模型,包括36个340M参数(20B标记)模型和36个1.3B参数(100B标记)模型,涵盖六种线性注意力变体和五种混合比例,通过基准测试评估其在语言建模和回忆任务上的表现。 Result: 实验表明,虽然语言建模在线性注意力与全注意力比例间保持稳定,但在回忆任务上,随着全注意力层数增加,尤其是在线性与全注意力比例低于3:1时,回忆性能显著提高。此外,表现最佳的独立线性模型并不一定在混合架构中同样出色。 Conclusion: 研究强调了选择性门控、分层递归和可控遗忘对有效混合模型的重要性,并建议使用HGRN-2或GatedDeltaNet架构,在线性注意力和全注意力之间保持3:1到6:1的比例,以高效实现Transformer级别的回忆能力。 Abstract: Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.

[14] On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma,Xiaodan Zhu

Main category: cs.CL

TL;DR: 该论文首次全面研究了对抗攻击下口头信心的鲁棒性,揭示了现有方法的不足,并强调了改进信心表达机制的紧迫性。

Details Motivation: 确保大型语言模型生成的口头信心具有鲁棒性对于高风险应用中的人机交互透明度、信任度和安全性至关重要。 Method: 引入了一个通过扰动和越狱方法攻击口头信心评分的新框架,并检验这些攻击对口头信心估计的影响。 Result: 研究表明对抗攻击显著危及口头信心估计并导致频繁答案更改,同时发现当前信心提取方法存在脆弱性,常用防御技术效果有限甚至适得其反。 Conclusion: 研究得出当前大型语言模型在面对对抗攻击时其口头信心估计存在严重漏洞,迫切需要设计更稳健的机制来提升LLMs的信心表达能力。 Abstract: Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

[15] Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Russell Taylor,Benjamin Herbert,Michael Sana

Main category: cs.CL

TL;DR: 本研究提出了一种结合最先进的大语言模型和双关语生成技术的三阶段方法,用于将英语双关语翻译成法语,旨在捕捉源文本的语言创造力和幽默感。

Details Motivation: 翻译双关语在不同语言之间存在独特的挑战,传统的机器翻译系统和专业的人类翻译都难以解决这个问题。研究者希望通过一种新的方法来弥补翻译研究与计算语言学之间的差距,提高对语言模型处理语义歧义、语音相似性和隐含文化语言意识的理解。 Method: 研究采用了三阶段的方法:首先,使用前沿的大语言模型并基于一个新的对比学习数据集建立基线;其次,实现结合语音-语义嵌入的引导式思维链管道;最后,通过多智能体生成器-判别器框架进行评估和重新生成双关语,并引入反馈机制。 Result: 该方法的主要目标是捕捉源文本中的语言创造力和幽默感,而不仅仅是复制其词汇内容。研究的最佳运行结果在CLEF JOKER 2025 Task 2竞赛中获得第一和第二名,且由专家法语母语者进行了手动评估。 Conclusion: 这项研究通过实施语言学指导的技术来翻译双关语,填补了翻译研究与计算语言学之间的空白,推动了我们对如何利用语言模型处理复杂语言现象的理解。 Abstract: Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology's primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.

[16] SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

Zicong Tang,Shi Luohe,Zuchao Li,Baoyuan Qi,Guoming Liu,Lefei Zhang,Ping Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为SpindleKV的新方法,用于平衡减少大型语言模型中的KV缓存,尤其在浅层和深层都取得了良好的效果。

Details Motivation: KV缓存的内存消耗对推理系统构成了挑战,尤其是对于浅层。 Method: 提出了基于注意力权重的驱逐方法(深层)和基于码本的替换方法(浅层)。 Result: 实验表明,与其他基线方法相比,SpindleKV在两个常见基准上实现了更好的KV缓存减少效果。 Conclusion: SpindleKV有效地减少了浅层和深层的KV缓存,同时保持了模型性能。 Abstract: Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.

[17] InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

Huisheng Wang,Zhuoshi Pan,Hangjing Zhang,Mingxiao Liu,Hanqing Gao,H. Vicky Zhao

Main category: cs.CL

TL;DR: 本文提出 InvestAlign,通过简单问题的理论解生成高质量 SFT 数据,有效提升 LLM 在投资决策中的表现,尤其在羊群行为下。

Details Motivation: 监督微调(SFT)需要大量真实的用户数据,但这些数据的收集成本高且存在隐私风险,这是行为金融学中一个根本性的限制。 Method: InvestAlign 框架通过利用相似且简单的最优投资问题的理论解来构建高质量的 SFT 数据集,而不是使用复杂场景。 Result: 使用 InvestAlign 生成的数据训练 LLM 可以比使用真实用户数据实现更快的参数收敛速度,表明学习效率更高;InvestAgent 在简单和复杂的投资问题上都比预 SFT 模型更接近真实用户数据。 Conclusion: InvestAlign 是一种有前景的方法,可以解决复杂最优投资问题,并使大型语言模型与在羊群行为下的投资者决策过程保持一致。 Abstract: Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.

[18] Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Yunyang Cao,Yanjun Li,Silong Dai

Main category: cs.CL

TL;DR: 本文提出一种针对工业场景中复杂合同信息提取任务的高质量数据集构建方法,并通过微调大语言模型验证了其有效性。

Details Motivation: 旨在解决工业场景中复杂合同信息提取任务的高质量数据集构建方法问题。 Method: 首先对工业合同文本进行聚类分析,并使用GPT-4和GPT-3.5从原始合同数据中提取关键信息,获得高质量的数据注释;其次通过构建新文本实现数据增强,并由GPT-3.5从随机组合的关键词生成非结构化合同文本,提高模型鲁棒性;最后基于高质量数据集对大语言模型进行微调。 Result: 实验结果表明,该模型在保证高字段召回率和精确度以及考虑解析效率的同时表现出色。LoRA、数据平衡和数据增强有效提升了模型准确性和鲁棒性。 Conclusion: 提出的方法为工业合同信息提取任务提供了一种新颖高效的解决方案。 Abstract: This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.

[19] The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

Juan B. Gutiérrez

Main category: cs.CL

TL;DR: 本文研究了大语言模型与人类交互构成的讨论网络,提出了错误的四个风险,并展示了通过网络设计提高可靠性的方法。

Details Motivation: 将大语言模型和人类视为平等节点,追踪陈述如何流通,并从孤立的幻觉扩展到定义无效化的现象。 Method: 开发了一个讨论网络模型,并使用数学模型分析了错误率的变化。此外,还操作化了同行评审过程,并创建了一个开源算法(FOO算法)来实现这一过程。 Result: 发现仅受漂移和自我修复控制的网络会在适度的错误率下稳定;加入新的虚构后,错误率升高;而给每个错误主张一个被同行评审的小概率可以将系统转向以真相为主导的状态。 Conclusion: 可靠性在这个新媒介中不是来自于完善单个模型,而是通过将不完美的模型连接成相互制约的网络来实现的。 Abstract: Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.

[20] Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis

Srihari K B,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper proposes a food QA framework combining a multimodal knowledge graph with generative AI, achieving significant performance improvements.

Details Motivation: To improve the reliability and diversity in food-domain question answering by leveraging structured knowledge and multimodal generation. Method: A unified food-domain QA framework that integrates a large-scale multimodal knowledge graph (MMKG) with generative AI, using joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large. Result: Improved BERTScore by 16.2%, reduced FID by 37.8%, boosted CLIP alignment by 31.1%, achieved 94.1% accurate image reuse and 85% adequacy in synthesis. Conclusion: The proposed framework successfully enhances reliability and diversity in food QA by combining structured knowledge with multimodal generation. Abstract: We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.

[21] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Liliang Ren,Congcong Chen,Haoran Xu,Young Jin Kim,Adam Atkinson,Zheng Zhan,Jiankai Sun,Baolin Peng,Liyuan Liu,Shuohang Wang,Hao Cheng,Jianfeng Gao,Weizhu Chen,Yelong Shen

Main category: cs.CL

TL;DR: 本文提出了一种新的高效序列建模方法SambaY,结合了State Space Models和Transformer的优势,在多个推理任务上表现优异,并提高了解码效率。

Details Motivation: 尽管现有的混合架构如Samba和YOCO在性能上有所提升,但尚未研究SSM层之间的表示共享潜力。因此,本文旨在探索一种高效的内存共享机制。 Method: 引入了一种称为门控记忆单元(GMU)的新机制,并将其应用于创建SambaY架构,通过交叉解码器共享内存读出状态以提高效率。 Result: SambaY在长上下文性能上得到了增强,展示了比YOCO基线更低的不可约损失,且在Math500、AIME24/25和GPQA Diamond等推理任务中表现出优于Phi4-mini-Reasoning的表现,同时在vLLM推理框架下提供了高达10倍的解码吞吐量提升。 Conclusion: SambaY显著提高了推理任务的性能和解码吞吐量,同时消除了对显式位置编码的需求。 Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

[22] FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

Boshko Koloski,Senja Pollak,Roberto Navigli,Blaž Škrlj

Main category: cs.CL

TL;DR: 本文提出了FuDoBa,一种将大型语言模型嵌入与领域特定知识相结合的方法,以提高文档表示的效果和效率。

Details Motivation: 解决LLM嵌入在领域特定应用中的高维、计算成本高以及过于通用的问题。 Method: 提出了一种基于贝叶斯优化的方法,融合了LLM嵌入和领域特定的结构化知识。 Result: 在六个数据集上的实验表明,该方法的性能与仅使用专有LLM嵌入基线的方法相当或更优。 Conclusion: FuDoBa通过结合LLM嵌入和领域特定的结构化知识,成功生成了低维度、任务相关的表示,并在与强大的AutoML分类器配对时表现出色。 Abstract: Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.

[23] Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review

James Stewart-Evans,Emma Wilson,Tessa Langley,Andrew Prayle,Angela Hands,Karen Exley,Jo Leonardi-Bee

Main category: cs.CL

TL;DR: This study tests whether the large language model Claude 3.5 Sonnet can speed up data extraction in scoping reviews using structured protocols. While it performs well on simple tasks, its ability to handle complex data is limited, and it often misses or misattributes information. Researchers are advised to rigorously evaluate LLM performance before relying on them for such tasks.

Details Motivation: The motivation behind this study was to explore whether large language models (LLMs) could be leveraged to make the resource-intensive data extraction phase of scoping reviews more efficient. Traditional methods are time-consuming, and there is growing interest in using LLMs alongside structured review protocols to accelerate this process while maintaining accuracy. Method: The researchers tested two protocol-based prompting approaches using Claude 3.5 Sonnet to extract data from 10 evidence sources within a case study scoping review. They evaluated the model's performance by measuring accuracy, precision, recall, and F1 scores across both simple and complex data items. Additionally, they assessed how feedback from the LLM could improve the extraction protocol and tested the model’s ability to detect deliberate errors in a modified dataset. Result: Claude 3.5 Sonnet demonstrated high accuracy (83.3% and 100%) in extracting simple citation details but performed significantly worse on complex, subjective data items (9.6% and 15.8%). Both approaches showed precision above 90%, but recall was below 25%, leading to low F1 scores (<40%). LLM feedback led to minor improvements, but when tested on a dataset with deliberate errors, only 5% of errors were detected. Conclusion: The study concludes that while LLMs like Claude 3.5 Sonnet can expedite data extraction in scoping reviews with high precision, their low recall and F1 scores highlight limitations in handling complex or subjective data items. The research emphasizes the need for robust performance evaluation of LLMs across various review contexts and suggests that researchers should carefully assess and report LLM performance when used for data extraction tasks. Abstract: The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.

[24] Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Gennadii Iakovlev

Main category: cs.CL

TL;DR: 论文开发了一种利用人工智能分析议会演讲情感以衡量精英阶层两极分化的指数,具备良好效度并可用于广泛数据分析。

Details Motivation: 研究旨在了解精英阶层如何评价对立党派,从而创建一个相互对立党派敌意的指标,即精英阶层两极分化指数。 Method: 利用人工智能识别政治家在议会演讲中何时提及彼此,记录谁在发言、谁被提及,并评估这些评价背后的情感温度。 Result: 建立了一个可聚合按政党和季度分析的数据集,指数显示出对选举活动、国家和政党层面的危机以及政党权力更迭的良好反应能力。 Conclusion: 该论文提出了一种通过人工智能进行行为者和主题检测来衡量精英阶层两极分化的新方法,并构建了一个可以按季度和政党聚合数据的指数,具有良好的表面效度。 Abstract: This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.

[25] CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs

Garapati Keerthana,Manik Gupta

Main category: cs.CL

TL;DR: 提出CLI-RAG框架解决临床文本生成难题,通过双阶段检索机制实现高精度结构化文本生成。

Details Motivation: 应对大型语言模型在临床文本生成中面临的挑战,包括患者数据的非结构化特性和临床笔记的长度与语义密度问题。 Method: 引入了一种名为CLI-RAG的领域特定框架,结合了层次化分块策略和任务特定的双阶段检索机制。 Result: 在MIMIC-III数据集上生成结构化进展记录,平均对齐得分为87.7%,超过真实临床医生撰写的基准80.7%。 Conclusion: CLI-RAG框架在生成结构化临床进展记录方面优于现有方法,展示了其在实际临床应用中的潜力。 Abstract: Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.

[26] On the Effect of Uncertainty on Layer-wise Inference Dynamics

Sunwoo Kim,Haneul Yoo,Alice Oh

Main category: cs.CL

TL;DR: 本文研究了大语言模型中确定性和不确定性输出的动态特性,发现它们在各层的概率轨迹是高度一致的,这对利用简单方法检测不确定性的传统观点提出了挑战。

Details Motivation: 了解大语言模型(LLMs)如何内部表示和处理其预测对于检测不确定性和防止幻觉至关重要。尽管已有研究表明模型在其隐藏状态中编码不确定性,但这种不确定性如何影响它们对隐藏状态的处理仍缺乏深入研究。 Method: 我们使用Tuned Lens(Logit Lens的一种变体)来分析11个数据集和5个模型的最终预测标记的逐层概率轨迹,以探索确定性和不确定性输出的动态特性。 Result: 我们发现确定性和不确定性输出的概率动态在各层中高度一致,这表明不确定性似乎不会显著影响推理动态;然而,更有能力的模型可能会学习到不同的不确定性处理方式。 Conclusion: 我们的研究结果挑战了在推理过程中利用简单方法检测不确定性的可行性,并展示了不确定性如何影响推理过程的解释性方法的应用。 Abstract: Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.

[27] KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution

Ye Kyaw Thu,Thura Aung,Thazin Myint Oo,Thepchai Supnithi

Main category: cs.CL

TL;DR: 本文首次应用Kolmogorov-Arnold卷积(KAConvText)于句子分类任务中,通过比较不同的嵌入配置和分类头,展示了其在仇恨言论检测、新闻分类和语言识别任务上的卓越性能。

Details Motivation: 本文旨在探索Kolmogorov-Arnold卷积在文本处理中的潜力,并解决不平衡与平衡数据集下的句子分类问题。 Method: 作者研究了多种嵌入配置,包括随机嵌入与fastText嵌入,在静态和微调设置下进行实验;同时比较了标准CNN及结合Kolmogorov-Arnold网络的CNN-KAN模型。此外,还评估了使用MLP和KAN作为分类头的效果。 Result: 采用微调fastText嵌入的KAConvText-MLP表现最佳:在仇恨言论检测任务上达到91.23%准确率(F1分数为0.9109),在新闻分类任务上达92.66%准确率(F1分数为0.9267),在语言识别任务上更是达到了99.82%准确率(F1分数为0.9982)。 Conclusion: KAConvText方法在多种句子分类任务中展现了优异性能,尤其是结合KAN分类头时提升了可解释性。 Abstract: This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.

[28] Checklist Engineering Empowers Multilingual LLM Judges

Mohammad Ghiasvand Mohammadkhani,Hamid Beigy

Main category: cs.CL

TL;DR: This paper proposes CE-Judge, a training-free framework for multilingual evaluation using checklist intuition with an open-source model.

Details Motivation: The motivation is to address the lack of exploration of LLM-as-a-Judge in multilingual contexts and overcome limitations of existing methods that rely on proprietary models or require extensive training data. Method: The method involves developing Checklist Engineering based LLM-as-a-Judge (CE-Judge), which utilizes checklist intuition for evaluation across multiple languages without requiring any training. Result: Experiments show that CE-Judge generally outperforms baselines and performs comparably to GPT-4o across multiple languages and three benchmark datasets in both pointwise and pairwise settings. Conclusion: The conclusion is that CE-Judge offers an effective, efficient, and training-free solution for multilingual text evaluation within the LLM-as-a-Judge paradigm. Abstract: Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

[29] Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining: Method, Evaluation and Applications

Seonwu Kim,Yohan Na,Kihun Kim,Hanhee Cho,Geun Lim,Mintae Kim,Seongik Park,Ki Hyun Kim,Youngsub Han,Byoung-Ki Jeon

Main category: cs.CL

TL;DR: This paper shows that applying Domain Adaptive Continual Pretraining (DACP) to small language models enhances domain-specific performance while maintaining overall effectiveness, offering a scalable and cost-efficient solution for enterprises.

Details Motivation: Organizations often lack the infrastructure to deploy and maintain large-scale language models, making small LLMs a practical alternative. However, their performance limitations necessitate effective domain adaptation methods like DACP. Method: The study applied a DACP-based recipe across diverse foundation models and service domains, using extensive experiments and real-world evaluations to validate effectiveness. Result: The application of DACP on small LLMs resulted in significant improvements in target domain performance without compromising general capabilities. Conclusion: DACP-applied sLLMs offer a cost-efficient and scalable solution for enterprise-level deployment by achieving substantial gains in target domain performance while preserving general capabilities. Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.

[30] Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Matthew Anderson Hendricks,Alice Cicirello

Main category: cs.CL

TL;DR: This paper proposes an automated strategy using SysML, NLP, and LLMs to generate computational models for engineering dynamical systems, showing faster and more accurate results compared to LLM-only approaches.

Details Motivation: The motivation is to accelerate the design and deployment of engineering dynamical systems by automating the generation of computational models using available domain knowledge and input documents. Method: The method involves five steps: utilizing SysML diagrams to extract detailed component information, applying NLP and LLMs to enhance intermediate outputs, generating SysML diagrams automatically, producing computational models via code generation, and validating results through case studies. Result: The approach successfully generated accurate computational models of complex dynamical systems from SysML diagrams across multiple case studies, including an end-to-end example of a simple pendulum with improved performance over LLM-only methods. Conclusion: The proposed approach effectively automates the generation of computational models for complex dynamical systems by leveraging SysML diagrams, NLP strategies, and LLMs, demonstrating improved performance over using LLMs alone. Abstract: This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

[31] Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework

Zenan Xu,Zexuan Qiu,Guanhua Huang,Kun Li,Siheng Li,Chenchen Zhang,Kejiao Li,Qi Yi,Yuhao Jiang,Bo Zhou,Fengzong Lian,Zhanhui Kang

Main category: cs.CL

TL;DR: This paper introduces a collaborative inference framework for large language models that combines sequential and parallel reasoning methods, using semantic entropy to efficiently evaluate and control the reasoning process.

Details Motivation: The motivation behind this research is to overcome the limitations of existing inference-time scaling techniques for large language models (LLMs), such as inefficiency in sequential reasoning due to arbitrary token budgets and lack of coordination in parallel reasoning. Method: The researchers introduced a new intrinsic quality metric called semantic entropy (SE) to assess model responses during collaborative inference. This metric measures the semantic diversity of parallel model responses and helps in dynamically controlling and terminating the reasoning process. Result: The proposed framework utilizing semantic entropy demonstrated its effectiveness as a robust indicator of reasoning quality, showing a strong negative correlation with accuracy and enabling dynamic control over the reasoning process. Conclusion: The study concludes that by combining the strengths of both sequential and parallel reasoning paradigms through the use of semantic entropy, a more efficient and effective collaborative inference framework can be developed. Abstract: Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy...

[32] Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Dahyun Lee,Yongrae Jo,Haeju Park,Moontae Lee

Main category: cs.CL

TL;DR: 本文提出SETR,一种新的段落选择方法,用于提升RAG系统在复杂查询下的性能。

Details Motivation: 现有的基于个体相关性的段落重排序方法难以满足复杂多跳问答的信息需求。 Method: 引入了集合式段落选择方法SETR,通过思维链推理明确识别查询的信息需求,并选择最优段落集满足这些需求。 Result: 实验显示SETR在多跳RAG基准上优于开源和专有的LLM重排序器。 Conclusion: SETR提供了一种有效的RAG系统替代传统重排序器的方法,实验表明其在答案正确性和检索质量方面优于现有方法。 Abstract: Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR

[33] Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights

Alexandra Abbas,Celia Waggoner,Justin Olive

Main category: cs.CL

TL;DR: This paper provides insights on the implementation and maintenance of AI evaluations, highlighting the need for specialized infrastructure, statistical rigor, and community coordination.

Details Motivation: AI evaluations are critical tools for assessing large language model capabilities and safety, but their implementation and maintenance pose unique challenges. Method: The paper presents practical insights from eight months of maintaining an open-source repository of AI evaluations and develops solutions for challenges in implementation and maintenance. Result: The analysis identifies key challenges in implementing and maintaining AI evaluations and develops solutions including a structured cohort management framework, statistical methodologies for optimal resampling and cross-model comparison, and systematic quality control processes. Conclusion: AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices. Abstract: AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining $inspect\_evals$, an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.

[34] SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

Luca Mariotti,Veronica Guidetti,Federica Mandreoli

Main category: cs.CL

TL;DR: SCoRE is a modular, cost-effective, and adaptable relation extraction system that leverages contrastive learning and a Bayesian kNN classifier to deliver robust performance on noisy datasets while reducing energy consumption.

Details Motivation: The growing demand for efficient knowledge graph enrichment has intensified interest in relation extraction under low-supervision settings. There is a need for adaptable and noise-resilient solutions that integrate seamlessly with pre-trained large language models. Method: SCoRE combines supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification. It also introduces two novel evaluation metrics: Correlation Structure Distance (CSD) and Precision at R (P@R). Result: Experiments show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Increasing model complexity degrades performance, highlighting the advantages of SCoRE's minimal design. Conclusion: SCoRE stands as an optimal choice for real-world RE applications due to its efficiency, modularity, and scalability. Abstract: The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE's minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.

[35] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Ziang Ye,Yang Zhang,Wentao Shi,Xiaoyu You,Fuli Feng,Tat-Seng Chua

Main category: cs.CL

TL;DR: This paper introduces VisualTrap, a backdoor attack targeting GUI agents' visual grounding, demonstrating how subtle poisoned data can hijack their behavior while remaining undetectable to humans.

Details Motivation: GUI agents integrated with personal devices offer human-like automation but introduce significant security concerns, particularly backdoor attacks. This work explores these vulnerabilities to understand and address potential threats. Method: The authors propose a method called VisualTrap, which involves injecting poisoned data during the pre-training of visual grounding to mislead the agent's decision-making process. The attack aims to hijack the mapping of textual plans to GUI elements using stealthy visual triggers. Result: Empirical results show that VisualTrap can effectively hijack visual grounding with minimal poisoned data (as low as 5%) and highly stealthy visual triggers. The attack remains effective across different GUI environments (e.g., mobile/web to desktop) and persists even after clean fine-tuning. Conclusion: The paper concludes that GUI agents powered by LVLMs are vulnerable to backdoor attacks through visual grounding, which can compromise the agent's behavior even with correct task-solving plans. This highlights the urgent need for further research into backdoor attack risks in these systems. Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent's behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.

[36] MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

Ziyan Liu,Chunxiao Fan,Haoran Lou,Yuexin Wu,Kaiwei Deng

Main category: cs.CL

TL;DR: MIND is a zero-shot harmful meme detection framework that leverages contextual information and multi-agent debate to effectively identify harmful content without annotated data.

Details Motivation: Traditional data-driven methods struggle with detecting new harmful memes due to their evolving nature and lack of annotated data, necessitating a more adaptive and zero-shot approach. Method: MIND uses a multi-agent framework with three strategies: retrieving similar memes from an unannotated set, a bi-directional insight derivation mechanism, and a multi-agent debate mechanism for decision-making. Result: Experiments on three meme datasets show that MIND outperforms existing zero-shot methods and generalizes well across different model architectures and parameter scales. Conclusion: The proposed MIND framework provides a scalable and effective solution for zero-shot harmful meme detection, outperforming existing approaches and demonstrating robust generalization. Abstract: The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy-lonely/MIND.

Xiao Wang,Jiahuan Pei,Diancheng Shui,Zhiguang Han,Xin Sun,Dawei Zhu,Xiaoyu Shen

Main category: cs.CL

TL;DR: 该论文提出了MPMCP数据集,用于评估法律判决预测中多个被告和指控对预测任务的影响。

Details Motivation: 法律判决预测中是否应单独处理多个被告和指控这一问题尚未充分研究。 Method: 引入MPMCP数据集,并在四个法律判决场景下对多个主流法律大语言模型进行评估。 Result: 实验发现,涉及多个被告和多个指控的场景(S4)最具挑战性,其次是S2、S3和S1。 Conclusion: MPMCP数据集的引入揭示了多个被告和指控对LJP任务带来的挑战,不同模型在不同场景下的表现存在显著差异。 Abstract: Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.

[38] Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

Fareya Ikram,Alexander Scarlatos,Andrew Lan

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(如Llama 3和GPT-4o)在数学辅导对话中预测导师策略和学生学习成果的能力,发现当前模型在预测导师策略方面仍有不足,而导师策略对学生结果影响显著。

Details Motivation: 近年来,随着在线学习的普及和由大型语言模型驱动的人工智能代理在辅导能力上的进步,辅导对话受到了广泛关注。已有研究表明,导师使用的策略对学生的成果有显著影响,因此有必要开发预测导师行为及其对学生影响的方法。然而,目前很少有研究关注对话中导师策略的预测。 Method: 研究使用了两个数学辅导对话数据集,评估现代LLM(特别是Llama 3和GPT-4o)预测对话中未来导师行为以及学生学习成果的能力。 Result: 研究发现,即使是最先进的LLM也难以预测未来的导师策略,而导师策略对学生的学习成果具有高度指示性。 Conclusion: 本文得出结论,尽管最先进的LLM在预测未来导师策略方面存在困难,但导师策略对学生的学习成果有显著影响,因此需要更强大的方法来处理这一任务。 Abstract: Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.

[39] Rethinking Verification for LLM Code Generation: From Generation to Testing

Zihan Ma,Taolin Zhang,Maosong Cao,Wenwei Zhang,Minnan Luo,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: 本文提出了SAGA方法,通过结合人类与大型语言模型的协作,显著提高了代码生成测试用例的质量和覆盖率。

Details Motivation: 现有的代码生成评估套件测试用例有限,导致性能评估不准确和奖励估计受损。 Method: 提出了多维指标来量化测试套件的全面性,并开发了SAGA方法,结合了人类编程专业知识和LLM推理能力。 Result: SAGA在TCGBench上实现了90.62%的检测率和32.58%的验证准确率,验证准确率比LiveCodeBench-v6高10.78%。 Conclusion: SAGA 提升了生成测试用例的覆盖率和质量,为可靠的LLM代码评估提供了可扩展的基础,并推动了代码生成中的RLVR和自动化对抗测试合成的发展。 Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

[40] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

Sezen Perçin,Xin Su,Qutub Sha Syed,Phillip Howard,Aleksei Kuvshinov,Leo Schwinn,Kay-Ulrich Scholl

Main category: cs.CL

TL;DR: This paper investigates how sensitive retrieval-augmented generation (RAG) systems are to variations in input queries, revealing significant performance degradation in retrievers even under minor changes, and proposes an evaluation framework along with actionable recommendations for improvement.

Details Motivation: RAG has been proposed as a solution to improve factual consistency and reduce hallucinations in large language models, but it faces practical challenges due to its dependence on query quality. This paper aims to systematically evaluate this issue. Method: The authors analyzed the RAG pipeline components' sensitivity to query perturbations through over 1092 experiments, using both general-domain and domain-specific datasets in an end-to-end question answering setting. Result: The performance of commonly used retrievers in RAG systems was found to degrade significantly under minor query variations, highlighting the need for more robust approaches. Conclusion: The study concludes that RAG systems are sensitive to query variations, and offers recommendations for improving their robustness. Abstract: Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.

[41] FRaN-X: FRaming and Narratives-eXplorer

Artur Muratov,Hana Fatima Shaikh,Vanshikaa Jani,Tarek Mahmoud,Zhuohan Xie,Daniil Orel,Aaryamonvikram Singh,Yuxia Wang,Aadi Joshi,Hasan Iqbal,Ming Shan Hee,Dhruv Sahnan,Nikolaos Nikolaidis,Purificação Silvano,Dimitar Dimitrov,Roman Yangarber,Ricardo Campos,Alípio Jorge,Nuno Guimarães,Elisa Sartori,Nicolas Stefanovitch,Giovanni Da San Martino,Jakub Piskorski,Preslav Nakov

Main category: cs.CL

TL;DR: FRaN-X 是一种自动化系统,能够从文本中识别实体及其叙述角色,并提供跨语言和领域的交互式分析工具。

Details Motivation: 该研究旨在解决从原始文本中自动检测和标记实体如何被描绘的挑战,为媒体分析师提供工具,以探索不同来源中的叙述框架。 Method: FRaN-X 使用了一个两阶段系统,结合序列标注与细粒度角色分类,识别出实体在文本中扮演的角色(如主角、反派或无辜者),并通过22个嵌套类别进行细化分类。 Result: FRaN-X 支持五种语言(保加利亚语、英语、印地语、俄语和葡萄牙语)和两个领域(俄乌冲突和气候变化),并提供了包括图表可视化、搜索功能和时间线视图在内的聚合分析。 Conclusion: FRaN-X 是一个可公开访问的多功能系统,用于自动检测和分类实体叙述角色,并提供交互式界面以支持多语言和多领域分析。 Abstract: We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity's role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.

[42] FlexOlmo: Open Language Models for Flexible Data Use

Weijia Shi,Akshita Bhagia,Kevin Farhat,Niklas Muennighoff,Pete Walsh,Jacob Morrison,Dustin Schwenk,Shayne Longpre,Jake Poznanski,Allyson Ettinger,Daogao Liu,Margaret Li,Dirk Groeneveld,Mike Lewis,Wen-tau Yih,Luca Soldaini,Kyle Lo,Noah A. Smith,Luke Zettlemoyer,Pang Wei Koh,Hannaneh Hajishirzi,Ali Farhadi,Sewon Min

Main category: cs.CL

TL;DR: FlexOlmo 是一种支持分布式训练和灵活推理的语言模型,能够在不共享数据的情况下有效整合不同数据源,兼顾性能提升与数据隐私保护。

Details Motivation: 解决数据共享受限场景下的模型训练与推理需求,尤其适用于需要保护数据隐私及尊重数据授权要求的行业。 Method: 采用混合专家(MoE)架构,每个专家独立训练于封闭数据集,并通过无联合训练的领域感知路由机制整合。模型基于 FlexMix 数据集训练,评估涵盖 31 个下游任务。 Result: FlexOlmo 在使用相同计算资源的情况下,相比标准 MoE 和先前模型合并方法分别平均提升 10.1% 及实现 41% 的相对改进,同时支持推理阶段对数据访问的细粒度控制。 Conclusion: FlexOlmo 提出了一种新的语言模型架构,解决了敏感或受保护数据在训练和推理中的隐私与控制问题。通过 MoE 架构和领域感知路由机制,实现了分布式训练、灵活推理,并优于现有方法。 Abstract: We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.

[43] UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Fengran Mo,Yifan Gao,Chuan Meng,Xin Liu,Zhuofeng Wu,Kelong Mao,Zhengyang Wang,Pei Chen,Zheng Li,Xian Li,Bing Yin,Meng Jiang

Main category: cs.CL

TL;DR: This paper presents a unified approach to combine retrieval and response generation for conversational search, achieving better performance than current methods.

Details Motivation: Existing systems use separate models for retrieval and generation, limiting their effectiveness. Unified models need better approaches to handle context understanding, retrieval, and response generation together. Method: Joint fine-tuning with designed mechanisms to reduce inconsistency and data discrepancy. Result: The unified model outperforms existing baselines on five conversational search datasets. Conclusion: The proposed unified model effectively combines dense retrieval and response generation, leading to mutual improvement in conversational search tasks. Abstract: The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.

[44] Discrete Diffusion Models for Language Generation

Ashen Weligalle

Main category: cs.CL

TL;DR: This thesis explores the application of discrete diffusion models, specifically D3PM, to natural language generation, comparing its performance with traditional autoregressive models. While autoregressive models perform better in terms of data compression, discrete diffusion models show promise in processing speed, offering potential advantages for parallel generation.

Details Motivation: Diffusion models have shown success in continuous data domains like image and video generation, but applying them to discrete data such as natural language remains challenging. This research investigates whether discrete diffusion models can effectively handle natural language generation tasks. Method: The study evaluates the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compares it with traditional autoregressive (AR) language models using metrics such as Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. All evaluations were conducted under consistent conditions to ensure a fair comparison. Result: The best-performing D3PM model achieved a BPT of 5.72 with a mean of 8.05, while the AR model showed better compression with a lower mean BPT of 4.59. However, D3PM demonstrated superior processing speed, reaching up to 3.97 batches per second, suggesting its potential for parallel generation. Conclusion: The research highlights the trade-offs between diffusion-based and autoregressive models in natural language generation, showing that while autoregressive models excel in generative quality, discrete diffusion models offer higher processing speeds, indicating potential for future non-autoregressive language generation techniques. Abstract: Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation order.This thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel generation.All evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation.

cs.CV [Back]

[45] Unveiling the Underwater World: CLIP Perception Model-Guided Underwater Image Enhancement

Jiangzhong Cao,Zekai Zeng,Xu Zhang,Huan Zhang,Chunling Fan,Gangyi Jiang,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的水下图像增强方法,结合了CLIP感知损失模块和课程对比正则化,以提高增强图像的感知质量和内容恢复效果。

Details Motivation: 由于传统基于深度学习的水下图像增强方法忽略了人类感知因素,导致增强后的图像感知质量下降或内容恢复不佳,因此需要一种更符合人类视觉感知的方法。 Method: 该方法利用CLIP模型的视觉语义特征提取能力,学习适当的提示对来映射和评估水下图像质量,并将其作为感知损失模块嵌入增强网络中;同时将CLIP感知模型与课程对比正则化结合,增强在CLIP感知空间中的约束。 Result: 大量实验表明,该方法在视觉质量和泛化能力方面均优于当前最先进的水下图像增强方法。 Conclusion: 本文提出了一种结合CLIP感知损失模块和课程对比正则化的水下图像增强方法,通过引入CLIP模型来提升增强图像的感知质量,并在实验中证明了其优于现有方法的视觉质量和泛化能力。 Abstract: High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic appeal.However, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content restoration.To address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.

[46] SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Ali Nasiri-Sarvi,Hassan Rivaz,Mahdi S. Hosseini

Main category: cs.CV

TL;DR: SPARC通过构建统一的潜在空间,解决了不同AI模型间高维概念编码不一致的问题,显著提高了跨模型的概念对齐效果。

Details Motivation: 现有方法如Sparse Autoencoders(SAEs)无法实现不同AI模型间的潜在概念空间兼容,限制了跨模型的可解释性。 Method: 提出了Global TopK稀疏机制和Cross-Reconstruction Loss,以构建统一的潜在空间。 Result: 在Open Images数据集上,SPARC将Jaccard相似度提高到0.80,相较之前的方法提升了三倍以上的对齐效果。 Conclusion: SPARC框架实现了不同模型和模态之间的概念对齐,为实现跨模型的可解释性和实际应用提供了有效方法。 Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.

[47] A Probabilistic Approach to Uncertainty Quantification Leveraging 3D Geometry

Rushil Desai,Frederik Warburg,Trevor Darrell,Marissa Ramirez de Chanlatte

Main category: cs.CV

TL;DR: BayesSDF是一个新的不确定性量化框架,它解决了当前方法中存在的几何一致性和计算效率问题,并在多个数据集上展示了出色的性能。

Details Motivation: 由于计算效率低下、可扩展性问题和几何不一致性,量化神经隐式3D表示中的不确定性(特别是那些使用符号距离函数(SDFs))仍然是一个重大挑战。现有的方法通常忽略了直接的几何集成,导致不确定性图校准不良。 Method: BayesSDF利用拉普拉斯近似通过基于Hessian的度量来量化局部表面不稳定性,从而实现计算高效的、表面感知的不确定性估计。 Result: BayesSDF在合成和真实世界数据集上进行了广泛的评估,证明了其在不确定性的几何一致性和校准方面的优越性能。 Conclusion: BayesSDF是一种用于神经隐式SDF模型中不确定性量化的新型概率框架,它在合成和真实世界数据集上的评估表明,其在几何一致性和校准方面优于现有方法。 Abstract: Quantifying uncertainty in neural implicit 3D representations, particularly those utilizing Signed Distance Functions (SDFs), remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. Existing methods typically neglect direct geometric integration, leading to poorly calibrated uncertainty maps. We introduce BayesSDF, a novel probabilistic framework for uncertainty quantification in neural implicit SDF models, motivated by scientific simulation applications with 3D environments (e.g., forests) such as modeling fluid flow through forests, where precise surface geometry and awareness of fidelity surface geometric uncertainty are essential. Unlike radiance-based models such as NeRF or 3D Gaussian splatting, which lack explicit surface formulations, SDFs define continuous and differentiable geometry, making them better suited for physical modeling and analysis. BayesSDF leverages a Laplace approximation to quantify local surface instability via Hessian-based metrics, enabling computationally efficient, surface-aware uncertainty estimation. Our method shows that uncertainty predictions correspond closely with poorly reconstructed geometry, providing actionable confidence measures for downstream use. Extensive evaluations on synthetic and real-world datasets demonstrate that BayesSDF outperforms existing methods in both calibration and geometric consistency, establishing a strong foundation for uncertainty-aware 3D scene reconstruction, simulation, and robotic decision-making.

[48] LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Zhang Li,Biao Yang,Qiang Liu,Shuo Zhang,Zhiyin Ma,Shuo Zhang,Liang Yin,Linger Deng,Yabo Sun,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: This paper proposes LIRA, a framework that improves segmentation and comprehension in multi-modal models through complementary visual processing techniques.

Details Motivation: Large multi-modal models struggle with inaccurate segmentation and hallucinated comprehension due to weak visual understanding and lack of detailed perception. Method: LIRA introduces two components: SEFE for improved object attribute inference through feature fusion, and ILVC for fine-grained supervision via local description generation. Additionally, AttrEval dataset is introduced for evaluation. Result: LIRA achieves state-of-the-art performance on segmentation and comprehension tasks while revealing a correlation between segmentation precision and semantic token relevance. Conclusion: LIRA provides an effective solution to the limitations of large multi-modal models in segmentation and comprehension by leveraging visual comprehension-segmentation complementarity. Abstract: While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

[49] Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques

Yassin Hussein Rassul,Aram M. Ahmed,Polla Fattah,Bryar A. Hassan,Arwaa W. Abdulkareem,Tarik A. Rashid,Joan Lu

Main category: cs.CV

TL;DR: This paper surveys offline handwritten data augmentation and generation techniques aimed at improving the accuracy of Handwritten Text Recognition (HTR) systems, especially for low-resource languages and complex scripts. Using the PRISMA methodology, it reviews 848 studies to assess traditional and deep learning-based approaches, highlighting challenges and proposing future research directions.

Details Motivation: Offline Handwritten Text Recognition (HTR) systems are essential for applications like historical document digitization and biometric authentication. However, their effectiveness is limited by the scarcity of annotated training data, particularly for low-resource languages and complex scripts. This survey aims to explore current augmentation and generation techniques to address these limitations. Method: This study uses the PRISMA methodology to systematically review offline handwritten data augmentation and generation techniques. It analyzes traditional methods alongside recent deep learning approaches such as GANs, diffusion models, and transformers, drawing on a comprehensive selection of 848 studies from major academic databases. Result: A total of 1,302 primary studies were initially considered, reduced to 848 after deduplication. These studies were evaluated to understand existing datasets, assessment metrics, and state-of-the-art methodologies in handwritten text generation, leading to the identification of key challenges and opportunities for future research. Conclusion: The paper concludes that despite the progress in data augmentation and generation techniques, challenges remain in generating diverse and realistic handwriting samples, especially for low-resource languages. The survey identifies research gaps and suggests future directions to enhance HTR systems' performance across various linguistic contexts. Abstract: Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.

[50] Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation

Joon Tai Kim,Tianle Chen,Ziyu Dong,Nishanth Kunchala,Alexander Guller,Daniel Ospina Acero,Roger Williams,Mrinal Kumar

Main category: cs.CV

TL;DR: 本文提出了一种名为CCPDA的增强方法,旨在通过提高数据集多样性并保留火灾类的关键特征,来改善深度学习多类分割模型的训练,特别是在野火科学领域。

Details Motivation: 为了解决在野火科学领域中,收集和标注用于训练分割模型的图像成本过高以及可靠的公共标注数据集稀缺的问题。 Method: CCPDA方法包括三个主要步骤:(i) 识别源图像中的火灾区域,(ii) 应用中心化技术以聚焦于火灾区域的核心,(iii) 将优化后的火灾区域粘贴到目标图像上。 Result: 数值性能评估验证了CCPDA方法在提升火灾类分割性能方面的有效性,并显示其在所考虑的应用场景中优于其他增强策略。 Conclusion: CCPDA方法在提升火灾类分割性能方面优于其他增强策略,特别是在缓解与小型手工标注训练数据集相关的困难方面表现出有效性。 Abstract: Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy-Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep-learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire-class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum-based multi-objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire-class segmentation performance.

[51] AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions

Fuyuan Zhang,Qichen Wang,Jianjun Zhao

Main category: cs.CV

TL;DR: This paper proposes AR2, an effective method to enhance the corruption robustness of pretrained CNNs by aligning class activation maps between clean and corrupted images.

Details Motivation: Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. Method: AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, following an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning. Result: AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. Conclusion: AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions. Abstract: Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.

[52] When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking

Weiran Li,Yeqiang Liu,Qiannan Guo,Yijie Wei,Hwa Liang Leo,Zhenbo Li

Main category: cs.CV

TL;DR: 本文提出了首个专为水下多鱼跟踪设计的数据集MFT25和一种新的跟踪框架SU-T,通过实验验证了其在HOTA和IDF1上的先进性能,推动了水下跟踪系统的研究。

Details Motivation: 尽管多目标跟踪技术在陆地应用中取得了显著进展,但水下跟踪场景由于其对海洋生态和水产养殖的重要性仍未得到充分探索。 Method: 引入了一种专门的跟踪框架SU-T,包括优化的无迹卡尔曼滤波器(UKF)和新型的Fish-Intersection-over-Union(FishIoU)匹配方法。 Result: 创建了首个专为水下多鱼跟踪设计的综合数据集MFT25,并展示了SU-T基准测试在HOTA和IDF1上的先进性能。 Conclusion: MFT25和SU-T为水下鱼类跟踪研究奠定了坚实的基础,推动了海洋生物学、水产养殖监测和生态保育的发展。 Abstract: Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. We present Multiple Fish Tracking Dataset 2025 (MFT25), the first comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear fish swimming patterns and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. MFT25 establishes a robust foundation for advancing research in underwater tracking systems with important applications in marine biology, aquaculture monitoring, and ecological conservation. The dataset and codes are released at https://vranlee.github.io/SU-T/.

[53] SImpHAR: Advancing impedance-based human activity recognition using 3D simulation and text-to-motion models

Lala Shakti Swarup Ray,Mengxi Liu,Deepika Gurung,Bo Zhou,Sungho Suh,Paul Lukowicz

Main category: cs.CV

TL;DR: This paper introduces SImpHAR, a framework for Human Activity Recognition using simulated bio-impedance signals and modular training, achieving significant performance improvements.

Details Motivation: Bio-impedance sensing is underutilized due to limited labeled data, prompting the need for improved data augmentation techniques. Method: The paper proposes a simulation pipeline for generating bio-impedance signals and a two-stage training strategy to improve HAR performance. Result: SImpHAR demonstrated consistent improvements over existing methods, achieving gains of up to 22.3% in accuracy and 21.8% in macro F1 score. Conclusion: The study concludes that simulation-driven augmentation and modular training show promising results for impedance-based HAR. Abstract: Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.

[54] Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

Hayat Ullah,Arslan Munir,Oliver Nina

Main category: cs.CV

TL;DR: 本文提出PCL-Former,一种用于时间动作定位的多阶段transformer架构,在多个数据集上表现优异。

Details Motivation: 受transformers和multi-stage架构在视频识别领域的成功启发,探索其在TAL任务中的时空特性。 Method: 提出了一个分阶段的transformer架构PCL-Former,包含Proposal-Former、Classification-Former和Localization-Former三个模块。 Result: 在THUMOS14、ActivityNet-1.3和HACS数据集上分别提升了2.8%、1.2%和4.8%的性能。 Conclusion: PCL-Former有效提升了TAL任务的表现,优于现有技术。 Abstract: Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.

[55] THOR: Thermal-guided Hand-Object Reasoning via Adaptive Vision Sampling

Soroush Shahi,Farzad Shahabi,Rama Nabulsi,Glenn Fernandes,Aggelos Katsaggelos,Nabil Alshurafa

Main category: cs.CV

TL;DR: 本研究提出THOR方法,通过热感应驱动的自适应RGB帧采样实现高效低功耗的手部活动识别。

Details Motivation: 传统可穿戴摄像头持续处理RGB图像存在高能耗、数据量大、隐私问题及计算资源需求高等问题,因此需要一种更高效的实时手部活动识别方法。 Method: 提出了一种基于热感应的时空自适应RGB帧采样方法(THOR),通过低分辨率热成像数据检测手部活动变化并调整RGB帧采样率,同时利用热线索定位RGB图像中的手部交互区域进行裁剪处理和实时分析。 Result: 使用仅3%的原始RGB视频数据,THOR成功捕获了所有活动片段,并在手部相关活动识别中达到了95%的F1分数,与使用全部数据的结果相当(94%)。 Conclusion: THOR实现了高效、低功耗的手部活动识别,为可穿戴摄像头在健康监测中的长期应用提供了可行方案。 Abstract: Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.

[56] EA: An Event Autoencoder for High-Speed Vision Sensing

Riadul Islam,Joey Mulé,Dhandeep Challagundla,Shahmir Rizvi,Sean Carson

Main category: cs.CV

TL;DR: 本文提出了一种用于事件相机的轻量级事件自编码器架构,解决了传统视觉系统的局限性,适用于低功耗、高实时性的边缘计算场景。

Details Motivation: 传统基于帧的视觉系统存在运动模糊、高延迟和冗余数据处理的问题,而事件相机由于其稀疏且嘈杂的事件流,在物体检测方面面临挑战,因此需要一种更高效的解决方案。 Method: 论文提出了一种基于卷积编码的事件自编码器架构,并结合了自适应阈值选择和轻量级分类器,以提高识别准确性并降低计算复杂度。 Result: 实验结果表明,该方法在Smart Event Face Dataset (SEFD)上达到了与YOLO-v4模型相当的准确率,但参数减少了35.5倍;在Raspberry Pi 4B和NVIDIA Jetson Nano等嵌入式平台上的实现显示帧率在8 FPS到44.8 FPS之间,且分类器的帧率比现有技术最高提升了87.84倍。 Conclusion: 该论文提出的事件自编码器架构在保持关键时空特征的同时,有效压缩和重建事件数据,显著提升了低功耗、高速实时边缘计算场景下的事件相机性能。 Abstract: High-speed vision sensing is essential for real-time perception in applications such as robotics, autonomous vehicles, and industrial automation. Traditional frame-based vision systems suffer from motion blur, high latency, and redundant data processing, limiting their performance in dynamic environments. Event cameras, which capture asynchronous brightness changes at the pixel level, offer a promising alternative but pose challenges in object detection due to sparse and noisy event streams. To address this, we propose an event autoencoder architecture that efficiently compresses and reconstructs event data while preserving critical spatial and temporal features. The proposed model employs convolutional encoding and incorporates adaptive threshold selection and a lightweight classifier to enhance recognition accuracy while reducing computational complexity. Experimental results on the existing Smart Event Face Dataset (SEFD) demonstrate that our approach achieves comparable accuracy to the YOLO-v4 model while utilizing up to $35.5\times$ fewer parameters. Implementations on embedded platforms, including Raspberry Pi 4B and NVIDIA Jetson Nano, show high frame rates ranging from 8 FPS up to 44.8 FPS. The proposed classifier exhibits up to 87.84x better FPS than the state-of-the-art and significantly improves event-based vision performance, making it ideal for low-power, high-speed applications in real-time edge computing.

[57] Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Ziyang Wang,Jaehong Yoon,Shoubin Yu,Md Mohaiminul Islam,Gedas Bertasius,Mohit Bansal

Main category: cs.CV

TL;DR: Video-RTS improves video reasoning efficiency and performance using a novel combination of data-efficient reinforcement learning and adaptive test-time scaling.

Details Motivation: Current RL-based video reasoning methods require costly SFT steps involving large-scale annotated datasets, making them hard to scale. Method: Efficient pure-RL training combined with sparse-to-dense video TTS strategy, using output-based rewards and iterative frame addition based on consistency. Result: Video-RTS achieves an average accuracy improvement of 2.4% over existing models using only 3.6% of the training samples, including a 4.2% gain on Video-Holmes and a 2.6% gain on MMVU. Conclusion: Video-RTS offers a more data-efficient and scalable solution for video reasoning with LLMs by combining RL training and adaptive TTS strategy. Abstract: Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

[58] Mask6D: Masked Pose Priors For 6D Object Pose Estimation

Yuechen Xie,Haobo Jiang,Jin Xie

Main category: cs.CV

TL;DR: 本文提出了Mask6D,一种新的姿态估计预训练方法,通过结合2D-3D对应图和可见掩码图提升在复杂场景下单目RGB图像的6D目标姿态估计效果。

Details Motivation: 当前基于2D特征主干网络的姿态估计方法在杂乱场景中因目标遮挡导致RGB信息有限时难以提取有效的姿态感知特征,因此需要一种新的预训练策略来提高姿态估计的鲁棒性。 Method: 该论文引入了姿态感知的2D-3D对应图和可见掩码图作为额外模态信息,并与RGB图像结合用于重建模型的预训练;设计了一个以对象为中心的预训练损失函数以减少背景干扰,并在常规姿态训练策略下对网络进行微调实现位姿预测。 Result: 实验表明,该方法在6D目标姿态估计任务中优于之前的端到端姿态估计方法,尤其是在处理杂乱或遮挡条件下的场景时表现出更强的鲁棒性。 Conclusion: 论文提出了一种名为Mask6D的位姿估计预训练策略,在杂乱或遮挡条件下使用单目RGB图像进行6D目标位姿估计,通过结合2D-3D对应图和可见掩码图实现了优于以往端到端方法的效果。 Abstract: Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.

[59] Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Yupeng Hu,Changxing Ding,Chang Sun,Shaoli Huang,Xiangmin Xu

Main category: cs.CV

TL;DR: 本文提出BC-HOI框架用于开放词汇的人-物交互检测,通过结合注意力偏置引导和大语言模型监督,在两个基准数据集上取得了优异表现。

Details Motivation: 现有方法依赖于视觉-语言模型产生的整体且粗粒度的特征,与检测任务的需求不符。 Method: 提出了一种双边协作框架(BC-HOI),包括注意力偏置引导(ABG)和基于大语言模型的监督引导(LSG)。 Result: 在HICO-DET和V-COCO两个基准上进行了实验,结果均优于现有方法。 Conclusion: BC-HOI框架在开放词汇HOI检测任务中表现出色,通过ABG和LSG组件生成细粒度特征,提升了模型性能。 Abstract: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.

[60] What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

Yaoqi Huang,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Main category: cs.CV

TL;DR: This survey introduces a comprehensive taxonomy and analytical framework for vision-based traffic scenario analysis, covering 35 tasks and 73 datasets to address existing weaknesses and guide future research in road safety.

Details Motivation: Advances in vision-based sensors and computer vision algorithms have improved traffic scenario understanding, but there is a need for systematic categorization of critical elements and comprehensive analysis of available tasks and datasets to enhance road safety. Method: The survey systematically categorizes attention-worthy traffic entities into two main groups: anomalies and normal but critical entities. It analyzes 35 vision-driven tasks and visualizes 73 datasets, providing a unified analytical framework and cross-domain investigation. Result: A new taxonomy integrating ten categories and twenty subclasses was developed, along with a unified analytical framework. The survey analyzed 35 tasks and examined 73 datasets, highlighting their pros and cons for standards unification and resource optimization. Conclusion: The survey provides an integrated taxonomy, comprehensive analysis, and recapitulatory tables as valuable contributions to the field of vision-based traffic scenario analysis, aiming to guide researchers in strategic resource selection and highlight critical research gaps. Abstract: Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.

[61] FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation

Liqiang Jing,Viet Lai,Seunghyun Yoon,Trung Bui,Xinya Du

Main category: cs.CV

TL;DR: 本文提出了用于视频多模态大语言模型的 FIFA 评估框架与 Post-Correction 修正工具,显著提升了生成内容的事实一致性。

Details Motivation: 现有评估方法局限于单一任务且无法评估开放式的幻觉问题,需要一个统一的评估与修正框架。 Method: 提出 FIFA 框架,通过提取描述性事实、构建时空语义依赖图并使用 VideoQA 模型验证;引入基于工具的 Post-Correction 修正框架。 Result: FIFA 在对齐人类判断方面优于现有评估方法,Post-Correction 能有效改善生成内容的事实一致性。 Conclusion: FIFA 评估框架和 Post-Correction 工具能有效提升视频多模态大语言模型在文本和视频生成中的事实一致性。 Abstract: Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.

[62] Concept Unlearning by Modeling Key Steps of Diffusion Process

Chaoshuo Zhang,Chenhao Lin,Zhengyu Zhao,Le Yang,Qian Wang,Chao Shen

Main category: cs.CV

TL;DR: This paper proposes KSCU, a targeted concept unlearning approach for text-to-image diffusion models that improves security without compromising image generation quality.

Details Motivation: To address security risks posed by misuse of text-to-image diffusion models while balancing unlearning effectiveness and generative retainability. Method: Key Step Concept Unlearning (KSCU) targets key steps in the diffusion model's stepwise sampling process for fine-tuning rather than uniformly treating all denoising steps. Result: KSCU effectively prevents undesirable image generation while retaining the model’s ability to generate high-quality images. Conclusion: KSCU is an effective concept unlearning method that strategically focuses on pivotal steps during the image generation process in T2I DMs, reducing parameter updates while preserving generative capabilities. Abstract: Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used. However, their misuse poses serious security risks. While existing concept unlearning methods aim to mitigate these risks, they struggle to balance unlearning effectiveness with generative retainability.To overcome this limitation, we innovatively propose the Key Step Concept Unlearning (KSCU) method, which ingeniously capitalizes on the unique stepwise sampling characteristic inherent in diffusion models during the image generation process. Unlike conventional approaches that treat all denoising steps equally, KSCU strategically focuses on pivotal steps with the most influence over the final outcome by dividing key steps for different concept unlearning tasks and fine-tuning the model only at those steps. This targeted approach reduces the number of parameter updates needed for effective unlearning, while maximizing the retention of the model's generative capabilities.Through extensive benchmark experiments, we demonstrate that KSCU effectively prevents T2I DMs from generating undesirable images while better retaining the model's generative capabilities.Our code will be released.

[63] Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation

Kazi Mahathir Rahman,Naveed Imtiaz Nafis,Md. Farhan Sadik,Mohammad Al Rafi,Mehedi Hasan Shahed

Main category: cs.CV

TL;DR: 本研究开发了一种全新的端到端系统,能将英语语音转化为逼真的3D手语动画,填补了现有研究在语音到手语翻译方面的空白。

Details Motivation: 帮助聋哑人和听力障碍者更方便地交流,但将口语翻译成手语的研究长期被忽视,且涉及多个复杂步骤,如理解语音、翻译成适合手语的语法以及生成自然的人体动作。 Method: 使用Whisper模型将英语语音转为文本,利用MarianMT机器翻译模型将文本翻译成美国手语(ASL)词汇,并通过Word2Vec和FastText等词嵌入技术优化翻译准确性;最后,使用基于3D关键点的动作系统对翻译后的词汇进行动画制作。 Result: 实现了从英语语音到自然连续的3D手语动画的端到端转换,BLEU评分分别达到0.7714和0.8923,同时创建了Sign3D-WLASL和BookGlossCorpus-CG两个新数据集支持系统运行。 Conclusion: 该研究构建了一个完整的从英语语音到流畅、逼真的3D手语动画的流水线系统,解决了以往研究忽视的将口语翻译成手语的问题。 Abstract: Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That's because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.

[64] ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture

Mingjin Zeng,Nan Ouyang,Wenkang Wan,Lei Ao,Qing Cai,Kai Sheng

Main category: cs.CV

TL;DR: 本文提出ILNet,通过逆向学习注意力和动态锚点选择模块,显著提升多智能体轨迹预测的准确性和适应性。

Details Motivation: 现有方法缺乏明确的时空协调交互建模,难以适应不同的未来环境;受人类驾驶行为启发,提出需要更好的意图假设和动态决策调整机制。 Method: 提出了一种新的多智能体轨迹预测方法ILNet,包括IL注意力和DAS模块。 Result: ILNet在INTERACTION和Argoverse数据集上取得了最先进的性能,具有更高的准确性和多模态分布,同时参数更少。 Conclusion: ILNet通过逆向学习注意力和动态锚点选择模块,在多智能体轨迹预测中实现了先进的性能,特别是在具有挑战性的交互场景中。 Abstract: Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model's ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.

[65] A model-agnostic active learning approach for animal detection from camera traps

Thi Thu Thuy Nguyen,Duc Thanh Nguyen

Main category: cs.CV

TL;DR: This paper proposes a model-agnostic active learning method for camera trap wildlife data that significantly reduces required labeled data while maintaining or improving model performance.

Details Motivation: To reduce data labeling effort and improve animal detection model training efficiency in wildlife monitoring using camera trap data. Method: The method integrates uncertainty and diversity measures at both object-based and image-based levels into the sample selection process for active learning. Result: Using only 30% of training data selected by the proposed method, a state-of-the-art detector achieved performance equal to or better than when trained on the full dataset. Conclusion: The proposed model-agnostic active learning approach effectively optimizes labelled data usage, achieving equal or better performance with only 30% of training data. Abstract: Smart data selection is becoming increasingly important in data-driven machine learning. Active learning offers a promising solution by allowing machine learning models to be effectively trained with optimal data including the most informative samples from large datasets. Wildlife data captured by camera traps are excessive in volume, requiring tremendous effort in data labelling and animal detection models training. Therefore, applying active learning to optimise the amount of labelled data would be a great aid in enabling automated wildlife monitoring and conservation. However, existing active learning techniques require that a machine learning model (i.e., an object detector) be fully accessible, limiting the applicability of the techniques. In this paper, we propose a model-agnostic active learning approach for detection of animals captured by camera traps. Our approach integrates uncertainty and diversity quantities of samples at both the object-based and image-based levels into the active learning sample selection process. We validate our approach in a benchmark animal dataset. Experimental results demonstrate that, using only 30% of the training data selected by our approach, a state-of-the-art animal detector can achieve a performance of equal or greater than that with the use of the complete training dataset.

[66] Token Bottleneck: One Token to Remember Dynamics

Taekyung Kim,Dongyoon Han,Byeongho Heo,Jeongeun Park,Sangdoo Yun

Main category: cs.CV

TL;DR: ToBo is a novel self-supervised pipeline that learns effective sequential visual representations, excelling in dynamic scene understanding tasks and real-world robotic applications.

Details Motivation: The motivation is to derive compact and temporally aware visual representations from dynamic scenes to improve sequential scene understanding tasks like visual tracking and robotic manipulation. Method: ToBo uses a squeeze-and-expansion approach, encoding scenes into a compact bottleneck token and predicting subsequent scenes using minimal patches as hints. Result: Experiments showed that ToBo outperforms baselines in sequential tasks such as video label propagation and robot manipulation, demonstrating robustness and scalability in simulated and real-world environments. Conclusion: ToBo is a self-supervised learning pipeline that effectively learns sequential scene representations, showing superior performance in various sequential tasks and real-world applications. Abstract: Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.

[67] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

Yonghyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Naoki Murata,Wei-Hsiang Liao,Woosung Choi,Kin Wai Cheuk,Junghyun Koo,Yuki Mitsufuji

Main category: cs.CV

TL;DR: 本文提出了一种名为Concept-TRAK的新方法,用于解决现有归因方法在特定元素上的不足,从而在版权问题和模型透明度方面提供更有价值的洞察。

Details Motivation: 现有的归因方法无法隔离对特定元素(如风格或对象)的贡献,这在版权问题和模型透明度日益受到关注的情况下至关重要。 Method: 通过一种称为Concept-TRAK的新方法进行概念级归因,该方法包括基于扩散后验采样的重新制定的扩散训练损失和概念感知奖励函数。 Result: 在AbC基准测试中评估显示,与之前的方法相比有显著改进,并通过多种案例研究展示了其在识别知识产权保护和不安全内容等方面的应用。 Conclusion: Concept-TRAK提供了一种新的概念级归因方法,有助于负责任的生成式AI开发和治理。 Abstract: While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies--ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning--we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.

[68] Divergence-Based Similarity Function for Multi-View Contrastive Learning

Jae Hyoung Jeon,Cheolsu Lim,Myungjoo Kang

Main category: cs.CV

TL;DR: This paper proposes DSF, a novel method for contrastive learning that captures joint structure among multiple views, leading to better performance and efficiency without needing a temperature hyperparameter.

Details Motivation: Prior methods only capture pairwise relationships and fail to model the joint structure across all views in contrastive learning. Method: Proposed a divergence-based similarity function (DSF) that captures joint structure across all views by representing each set of augmented views as a distribution and measuring similarity through divergence between distributions. Result: Extensive experiments show DSF improves performance on tasks like kNN classification and linear evaluation while being more efficient than other multi-view methods. Also, DSF theoretically connects with cosine similarity but operates effectively without requiring a temperature parameter. Conclusion: DSF is a more effective and efficient method for modeling multiple views in contrastive learning, without the need for a temperature hyperparameter. Abstract: Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of an instance. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across various tasks, including kNN classification and linear evaluation, while also offering greater efficiency compared to other multi-view methods. Furthermore, we establish a theoretical connection between DSF and cosine similarity, and show that, unlike cosine similarity, DSF operates effectively without requiring a temperature hyperparameter.

[69] Edge-Boundary-Texture Loss: A Tri-Class Generalization of Weighted Binary Cross-Entropy for Enhanced Edge Detection

Hao Shu

Main category: cs.CV

TL;DR: This paper introduces the Edge-Boundary-Texture (EBT) loss for improved edge detection, offering better precision, contextual understanding, and practical usability compared to existing approaches.

Details Motivation: Edge detection performance is hindered by ambiguous non-edge pixels near object boundaries, which current methods like WBCE fail to address effectively. Method: Proposed Edge-Boundary-Texture (EBT) loss that categorizes pixels into edge, boundary, and texture classes with distinct supervisory weights. Result: Experiments show that the EBT loss outperforms WBCE both quantitatively and perceptually across multiple benchmarks, with robustness to hyperparameter variations. Conclusion: The EBT loss is a more structured and effective alternative to the WBCE loss for edge detection, offering minimal fine-tuning and practical deployment advantages. Abstract: Edge detection (ED) remains a fundamental task in computer vision, yet its performance is often hindered by the ambiguous nature of non-edge pixels near object boundaries. The widely adopted Weighted Binary Cross-Entropy (WBCE) loss treats all non-edge pixels uniformly, overlooking the structural nuances around edges and often resulting in blurred predictions. In this paper, we propose the Edge-Boundary-Texture (EBT) loss, a novel objective that explicitly divides pixels into three categories, edge, boundary, and texture, and assigns each a distinct supervisory weight. This tri-class formulation enables more structured learning by guiding the model to focus on both edge precision and contextual boundary localization. We theoretically show that the EBT loss generalizes the WBCE loss, with the latter becoming a limit case. Extensive experiments across multiple benchmarks demonstrate the superiority of the EBT loss both quantitatively and perceptually. Furthermore, the consistent use of unified hyperparameters across all models and datasets, along with robustness to their moderate variations, indicates that the EBT loss requires minimal fine-tuning and is easily deployable in practice.

[70] MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

Yin Wang,Mu li,Zhiying Leng,Frederick W. B. Li,Xiaohui Liang

Main category: cs.CV

TL;DR: MOST introduces a novel motion diffusion model using temporal clip Banzhaf interaction to effectively generate human motion from rare language prompts.

Details Motivation: The motivation is to overcome challenges in text-to-motion generation, particularly coarse-grained matching and redundancy issues, especially for rare language prompts. Method: The method involves two stages: (1) Retrieval through temporal clip Banzhaf interaction to quantify textual-motion coherence at the clip level, and (2) Generation using a motion prompt module that utilizes retrieved clips for producing movements. Result: MOST achieves state-of-the-art performance in text-to-motion retrieval and generation, as shown by quantitative and qualitative results, particularly excelling with rare prompts. Conclusion: MOST provides a solution for generating human motion from rare language prompts by using a retrieval and generation stage, which together improve semantic consistency and address redundancy. Abstract: We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.

[71] Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning

Yang Chen,Yueqi Duan,Haowen Sun,Jiwen Lu,Yap-Peng Tan

Main category: cs.CV

TL;DR: This paper introduces AMContrast3D and AMContrast3D++ for improved 3D semantic segmentation on point clouds by addressing point ambiguity issues through adaptive contrastive learning and masked refinement.

Details Motivation: Existing methods ignore point ambiguities in 3D semantic segmentation, leading to sub-optimal models due to unreliable labels in transition regions. Method: AMContrast3D and AMContrast3D++ methods with contrastive learning, ambiguity estimation framework, and masked refinement mechanism. Result: The proposed method achieves effective segmentation performance on S3DIS and ScanNet datasets, demonstrating enhanced robustness and reliability through ambiguity-aware training. Conclusion: AMContrast3D++ improves 3D semantic segmentation by incorporating ambiguity prediction and masked refinement, leading to better performance and robustness on indoor scene datasets. Abstract: This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.

[72] Capturing Stable HDR Videos Using a Dual-Camera System

Qianyu Zhang,Bolun Zheng,Hangjia Pan,Lingyu Zhu,Zunjie Zhu,Zongpeng Li,Shiqi Wang

Main category: cs.CV

TL;DR: This paper proposes a dual-camera system and an exposure-adaptive fusion network to improve HDR video reconstruction by reducing flickering and ghosting artifacts, achieving state-of-the-art results.

Details Motivation: Exposure fluctuations in reference images from alternating exposure methods often result in flickering in HDR video reconstruction. This work aims to address this issue by proposing a more stable and robust method. Method: A dual-camera system is used for HDR video acquisition, with one camera capturing consistent reference sequences and the other capturing non-reference sequences. An exposure-adaptive fusion network (EAFNet) is introduced, consisting of a pre-alignment subnetwork, an asymmetric cross-feature fusion subnetwork, and a DWT-based multiscale reconstruction subnetwork. Result: Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance on different datasets, showing its effectiveness in reducing ghosting artifacts and improving HDR video reconstruction. Conclusion: The proposed dual-camera system (DCS) and exposure-adaptive fusion network (EAFNet) achieve state-of-the-art performance in HDR video reconstruction, effectively addressing flickering caused by exposure fluctuations. Abstract: In HDR video reconstruction, exposure fluctuations in reference images from alternating exposure methods often result in flickering. To address this issue, we propose a dual-camera system (DCS) for HDR video acquisition, where one camera is assigned to capture consistent reference sequences, while the other is assigned to capture non-reference sequences for information supplementation. To tackle the challenges posed by video data, we introduce an exposure-adaptive fusion network (EAFNet) to achieve more robust results. EAFNet introduced a pre-alignment subnetwork to explore the influence of exposure, selectively emphasizing the valuable features across different exposure levels. Then, the enhanced features are fused by the asymmetric cross-feature fusion subnetwork, which explores reference-dominated attention maps to improve image fusion by aligning cross-scale features and performing cross-feature fusion. Finally, the reconstruction subnetwork adopts a DWT-based multiscale architecture to reduce ghosting artifacts and refine features at different resolutions. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance on different datasets, validating the great potential of the DCS in HDR video reconstruction. The codes and data captured by DCS will be available at https://github.com/zqqqyu/DCS.

[73] Cross-Modal Dual-Causal Learning for Long-Term Action Recognition

Xu Shaowu,Jia Xibin,Gao Junyu,Sun Qianmei,Chang Jing,Fan Chao

Main category: cs.CV

TL;DR: 本文提出了一种新的长期动作识别方法CMDCL,通过双因果干预机制处理跨模态偏差和视觉混淆因素,在多个基准测试中表现优异。

Details Motivation: 尽管视觉-语言模型在长期动作识别方面展现出前景,但它们往往依赖于统计相关性而非因果机制,并且现有的基于因果的方法缺乏跨模态因果建模。 Method: CMDCL采用双因果干预机制,包括文本因果干预和由去偏文本引导的视觉因果干预。 Result: 该方法在Charades、Breakfast和COIN三个基准测试中展示了有效性。 Conclusion: CMDCL有效地解决了长期动作识别中的跨模态偏差和视觉混淆因素,通过引入结构因果模型来揭示视频与标签文本之间的因果关系。 Abstract: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.

[74] Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation

Qing Zhang,Guoquan Pei,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Omni-Fuse的新型超光谱图像分割网络,通过多种跨维度特征融合机制有效提升医学超光谱图像的分割性能。

Details Motivation: 医学超光谱成像(MHSI)具有识别组织细微生化特性的潜力,但其高维性和光谱冗余性使得空间和光谱信息的有效融合变得困难。 Method: 提出了一种新的空间-光谱全融合网络Omni-Fuse,包括跨维度增强模块、光谱引导的空间查询选择和两阶段跨维度解码器。 Result: 在两个微观超光谱图像数据集上的实验表明,该方法相比最先进的方法,在DSC上提高了超过5.73个百分点。 Conclusion: Omni-Fuse是一种高效的超光谱图像分割网络,通过跨维度特征融合操作显著提高了分割性能。 Abstract: Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: https://github.com/DeepMed-Lab-ECNU/Omni-Fuse.

[75] PointVDP: Learning View-Dependent Projection by Fireworks Rays for 3D Point Cloud Segmentation

Yang Chen,Yueqi Duan,Haowen Sun,Ziwei Wang,Jiwen Lu,Yap-Peng Tan

Main category: cs.CV

TL;DR: 本文提出了一种新颖的视图相关投影方法PointVDP,用于点云分割,解决了传统方法在投影多样性和计算效率上的不足。

Details Motivation: 现有基于投影的方法使用预定义的视图无关投影参数,在复杂场景中难以捕捉足够的视图多样性,且多投影策略会带来过高的计算开销。 Method: 设计了一种基于数据驱动的视图相关投影(VDP)框架,结合烟火启发的射线预测方法和颜色正则化策略,以优化2D图像中的特征表示并减少计算冗余。 Result: 实验结果表明,PointVDP能够在边际计算成本下实现高效的语义理解,提供资源友好的点云分割解决方案。 Conclusion: PointVDP通过动态适应空间几何变化的视图相关投影框架,实现了点云分割的高效处理,并在S3DIS和ScanNet基准测试中取得了具有竞争力的结果。 Abstract: In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.

[76] EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision

Myungjang Pyeon,Janghyeon Lee,Minsoo Lee,Juseung Yun,Hwanil Choi,Jonghyun Kim,Jiwon Kim,Yi Hu,Jongseong Jang,Soonyoung Lee

Main category: cs.CV

TL;DR: EXAONE Path 2.0 improves patch-level representation learning via direct slide-level supervision, outperforming SSL methods in biomarker prediction while being highly data-efficient.

Details Motivation: Patch-level SSL methods may overlook complex domain-specific features essential for biomarker prediction due to reliance on basic augmentations and small patch-level areas. Additionally, SSL approaches are less data-efficient compared to supervised methods, requiring extensive resources to achieve competitive performance. Method: The proposed method involves training patch-level representations directly under slide-level supervision to overcome the limitations of patch-level self-supervised learning (SSL). Result: Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency. Conclusion: EXAONE Path 2.0 is a pathology foundation model that learns patch-level representations under direct slide-level supervision, achieving state-of-the-art performance in biomarker prediction tasks with high data efficiency. Abstract: In digital pathology, whole-slide images (WSIs) are often difficult to handle due to their gigapixel scale, so most approaches train patch encoders via self-supervised learning (SSL) and then aggregate the patch-level embeddings via multiple instance learning (MIL) or slide encoders for downstream tasks. However, patch-level SSL may overlook complex domain-specific features that are essential for biomarker prediction, such as mutation status and molecular characteristics, as SSL methods rely only on basic augmentations selected for natural image domains on small patch-level area. Moreover, SSL methods remain less data efficient than fully supervised approaches, requiring extensive computational resources and datasets to achieve competitive performance. To address these limitations, we present EXAONE Path 2.0, a pathology foundation model that learns patch-level representations under direct slide-level supervision. Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency.

[77] Learning from Sparse Point Labels for Dense Carcinosis Localization in Advanced Ovarian Cancer Assessment

Farahdiba Zarin,Riccardo Oliva,Vinkle Srivastav,Armine Vardazaryan,Andrea Rosati,Alice Zampolini Faustini,Giovanni Scambia,Anna Fagotti,Pietro Mascagni,Nicolas Padoy

Main category: cs.CV

TL;DR: This paper proposes a novel loss function, Crag and Tail loss, to enable efficient sparse heatmap regression for carcinosis keypoint localization from laparoscopic video frames, addressing the challenge of learning from limited annotations.

Details Motivation: The motivation stems from the challenge of obtaining dense pixel-level annotations in medical tasks due to high annotation costs, particularly for newly introduced tasks like 2D carcinosis keypoint localization in laparoscopic video frames. Method: The authors formulated the task as a sparse heatmap regression problem using only a few point annotations per image. They introduced a new loss function called Crag and Tail loss to address challenges in learning from sparse data while minimizing the impact of false negatives. Result: Through an extensive ablation study, the approach demonstrated effectiveness in achieving accurate dense localization of carcinosis keypoints, showing potential for advancing research where dense annotations are difficult to obtain. Conclusion: The paper concludes that the proposed Crag and Tail loss function effectively enables sparse heatmap regression for accurate dense localization of carcinosis keypoints, despite limited pixel-level annotations. Abstract: Learning from sparse labels is a challenge commonplace in the medical domain. This is due to numerous factors, such as annotation cost, and is especially true for newly introduced tasks. When dense pixel-level annotations are needed, this becomes even more unfeasible. However, being able to learn from just a few annotations at the pixel-level, while extremely difficult and underutilized, can drive progress in studies where perfect annotations are not immediately available. This work tackles the challenge of learning the dense prediction task of keypoint localization from a few point annotations in the context of 2d carcinosis keypoint localization from laparoscopic video frames for diagnostic planning of advanced ovarian cancer patients. To enable this, we formulate the problem as a sparse heatmap regression from a few point annotations per image and propose a new loss function, called Crag and Tail loss, for efficient learning. Our proposed loss function effectively leverages positive sparse labels while minimizing the impact of false negatives or missed annotations. Through an extensive ablation study, we demonstrate the effectiveness of our approach in achieving accurate dense localization of carcinosis keypoints, highlighting its potential to advance research in scenarios where dense annotations are challenging to obtain.

[78] ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data

Chengkun Li,Yuqi Tong,Kai Chen,Zhenya Yang,Ruiyang Li,Shi Qiu,Jason Ying-Kuen Chan,Pheng-Ann Heng,Qi Dou

Main category: cs.CV

TL;DR: 本文提出ClipGS,一种基于Gaussian splatting的交互式体积医学数据可视化框架,通过创新性的可学习截断方案和自适应调整模型显著提升渲染质量和效率。

Details Motivation: 体积医学数据的可视化对诊断、手术规划和教育至关重要,但现有的电影级渲染技术因计算成本高且渲染速度慢而难以满足交互需求。 Method: 提出了一种支持剪切平面的Gaussian splatting框架ClipGS,并引入了可学习的截断方案和自适应调整模型以优化渲染效果。 Result: 在五组体积医学数据上的实验表明,ClipGS平均达到36.635 PSNR的渲染质量,156 FPS的帧率以及16.1 MB的模型大小,在渲染质量和效率上优于现有方法。 Conclusion: ClipGS实现了高效的体积医学数据可视化,通过可学习的截断方案和自适应调整模型,在保持高质量渲染的同时提升了交互体验。 Abstract: The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.

[79] Diff$^2$I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior

Juncheng Mu,Chengwei Ren,Weixiang Zhang,Liang Pan,Xiao-Ping Zhang,Yue Gao

Main category: cs.CV

TL;DR: 本文提出Diff²I2P,利用扩散模型作为先验知识,通过创新性的CSD和DCT技术,解决了图像与点云配准中的模态差异问题,并显著提升了配准效果。

Details Motivation: 现有方法在图像到点云配准中未能有效解决模态差异问题,因此作者受到扩散模型在跨模态生成成功应用的启发,提出一种新的解决方案来提升跨模态对应关系的准确性。 Method: 提出了Control-Side Score Distillation(CSD)技术和Deformable Correspondence Tuning(DCT)模块,结合可微分PnP求解器,以实现跨模态特征学习和变换估计的端到端优化。 Result: Diff²I2P在多个实验中一致优于当前最先进的图像到点云配准方法,尤其是在7-Scenes基准数据集上取得了超过7%的注册召回率提升。 Conclusion: Diff²I2P通过引入扩散模型作为先验知识,显著提高了图像到点云配准的性能,特别是在7-Scenes基准数据集上实现了超过7%的配准召回率提升。 Abstract: Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose Diff$^2$I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff$^2$I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark.

[80] MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval

Naoya Sogi,Takashi Shibata,Makoto Terao,Masanori Suganuma,Takayuki Okatani

Main category: cs.CV

TL;DR: This paper proposes a new approach called Multi-Source DPPs to enhance the efficiency of Text-to-Image Retrieval by refining the diversities of multiple attributes according to the application's context.

Details Motivation: The motivation is to overcome the limitations of conventional methods in Result diversification (RD) for Text-to-Image Retrieval, which focus solely on increasing the diversity metric of image appearances without considering varying diversity metrics and their desired values across different applications. Method: The authors propose Multi-Source DPPs which extend the Determinantal Point Process to multi-sources. They model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation and introduce Tangent Normalization to reflect contexts. Result: Extensive experiments demonstrate the effectiveness of the proposed Multi-Source DPPs method in refining the diversities of multiple attributes according to the application's context. Conclusion: The paper concludes that the proposed Multi-Source DPPs method effectively addresses the task of Contextual Diversity Refinement of Composite Attributes (CDR-CA) by extending Determinantal Point Process (DPP) to multi-sources and modeling it as a single DPP with a unified similarity matrix based on a manifold representation, along with the introduction of Tangent Normalization to reflect contexts. Abstract: Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application's context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts. Extensive experiments demonstrate the effectiveness of the proposed method. Our code is publicly available at https://github.com/NEC-N-SOGI/msdpp.

[81] Enhancing Diffusion Model Stability for Image Restoration via Gradient Management

Hongjie Wu,Mingqin Zhang,Linchao He,Ji-Zhe Zhou,Jiancheng Lv

Main category: cs.CV

TL;DR: This paper introduces SPGD, a novel gradient management technique for image restoration that addresses instabilities in the generative process, enhancing generation stability and achieving superior results.

Details Motivation: The authors aim to explore the interactions between denoising and likelihood guidance steps in Bayesian inference for image restoration, addressing identified instabilities. Method: The paper proposes Stabilized Progressive Gradient Diffusion (SPGD), which includes a progressive likelihood warm-up strategy and adaptive directional momentum (ADM) smoothing. Result: The experiments show that SPGD significantly enhances generation stability across diverse restoration tasks. Conclusion: SPGD enhances generation stability and achieves state-of-the-art performance in quantitative metrics and visually superior results. Abstract: Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at \href{https://github.com/74587887/SPGD}{here}.

[82] MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning

Yifan Yang,Peili Song,Enfan Lan,Dong Liu,Jingtai Liu

Main category: cs.CV

TL;DR: This paper introduces MK-Pose, a novel framework for category-level object pose estimation that combines RGB images, point clouds, and textual descriptions, achieving better accuracy than existing methods.

Details Motivation: Category-level object pose estimation is crucial for applications such as warehouse automation and manufacturing, but current methods struggle with occlusion and generalization across instances and categories. This motivates the need for a more robust solution like MK-Pose. Method: The paper proposes MK-Pose, a multimodal-based keypoint learning framework that integrates RGB images, point clouds, and textual descriptions. It includes a self-supervised keypoint detection module with attention-based query generation, soft heatmap matching, and graph-based relational modeling, as well as a graph-enhanced feature fusion module. Result: MK-Pose was evaluated on CAMERA25 and REAL275 datasets and tested for cross-dataset capability on HouseCat6D dataset, showing superior performance compared to existing methods in terms of IoU and average precision. Conclusion: MK-Pose is able to outperform existing state-of-the-art methods in both IoU and average precision without shape priors, demonstrating its effectiveness for category-level object pose estimation. Abstract: Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.

[83] FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting

Boyuan Tian,Qizhe Gao,Siran Xianyu,Xiaotong Cui,Minjia Zhang

Main category: cs.CV

TL;DR: 提出了一种名为FlexGaussian的3D高斯压缩方法,能够在不进行重新训练的情况下实现高效的压缩,并适应不同的压缩目标。

Details Motivation: 现有的压缩方法虽然能有效减少3D高斯参数,但通常需要大量的重新训练或微调,缺乏在不同压缩约束下的灵活性。 Method: 结合混合精度量化和属性区分剪枝的方法,实现无需训练的3D高斯压缩。 Result: FlexGaussian实现了高达96.4%的压缩率,同时保持高渲染质量,且比现有方法更快,适用于移动设备。 Conclusion: FlexGaussian是一种灵活且高效的方法,能够有效解决3D高斯场景表示的压缩问题,适合在资源受限的设备上部署。 Abstract: 3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian

[84] Text-promptable Object Counting via Quantity Awareness Enhancement

Miaojing Shi,Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Li Li

Main category: cs.CV

TL;DR: 本文提出了一种用于解决文本可提示对象计数问题的数量感知模型QUANet,通过引入面向数量的文本提示和视觉-文本数量对齐损失,结合双流自适应计数解码器结构,在多个标准数据集上验证了模型的零样本类别无关计数泛化能力。

Details Motivation: 现有方法通过指定图像中对象类别的文本提示来解决对象计数问题,但不足以训练模型准确区分对象数量。因此,需要一种更有效的模型来提升数量感知能力。 Method: 提出QUANet,包括面向数量的文本提示与视觉-文本数量对齐损失,并设计双流自适应计数解码器(包含Transformer流、CNN流以及Transformer-to-CNN增强适配器),最后采用跨流数量排序损失优化预测结果。 Result: 在FSC-147、CARPK、PUCPR+和ShanghaiTech等标准数据集上进行了大量实验,结果表明该模型在零样本类别无关计数任务中具有强大的泛化能力。 Conclusion: QUANet通过引入数量导向的文本提示和双流结构优化,有效提升了模型在对象计数任务中的性能,展示了在零样本设置下的良好泛化能力。 Abstract: Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model's quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model's strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet

[85] StixelNExT++: Lightweight Monocular Scene Segmentation and Representation for Collective Perception

Marcel Vosshans,Omar Ait-Aider,Youcef Mezouar,Markus Enzweiler

Main category: cs.CV

TL;DR: 本文提出了一种名为 StixelNExT++ 的新方法,用于单目感知系统的场景表示,实现了高效的场景压缩、实时性能以及在自主系统中的潜在应用。

Details Motivation: 需要一种高效的场景表示方法,既能压缩场景信息,又能适配点云和鸟瞰图表示,从而支持实时和高精度的感知任务。 Method: 基于已有的 Stixel 表示,该方法推断 3D Stixel 并通过聚类较小的 3D Stixel 单元来增强对象分割。同时使用轻量级神经网络并结合 LiDAR 的自动标注数据进行训练。 Result: 实验结果表明,StixelNExT++ 在 Waymo 数据集上实现了 30 米范围内的竞争性能,并达到了每帧低至 10 毫秒的实时运算速度。 Conclusion: StixelNExT++ 表现出在自主系统集体感知中的潜力,并在单目感知系统场景表示方面提供了高效且实时的解决方案。 Abstract: This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird's-eye-view representations. Our lightweight neural network, trained on automatically generated LiDAR-based ground truth, achieves real-time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30-meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.

[86] Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis

Hao Tang,Ling Shao,Zhenyu Zhang,Luc Van Gool,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种用于音乐引导舞蹈视频合成任务的新型时空图Mamba(STG-Mamba)模型。

Details Motivation: 为了解决从输入音乐生成舞蹈视频的问题,需要一种能够有效捕捉关节间空间和时间维度依赖性的方法。 Method: STG-Mamba包括两个翻译映射:音乐到骨架翻译和骨架到视频翻译。在音乐到骨架翻译中,引入了新的时空图Mamba(STGM)块来构建骨架序列;在骨架到视频翻译中,提出了一个新的自监督正则化网络来将生成的骨架和条件图像转换成舞蹈视频。 Result: 作者收集了一个包含54,944个视频片段的新骨架到视频翻译数据集,并进行了大量实验,结果表明STG-Mamba显著优于现有方法。 Conclusion: 提出的STG-Mamba模型在音乐引导舞蹈视频合成任务中表现出色,比现有方法效果更好。 Abstract: We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.

[87] A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

Zhenyang Liu,Sixiao Zheng,Siyu Chen,Cairong Zhao,Longfei Liang,Xiangyang Xue,Yanwei Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为 SpatialReasoner 的新框架,通过结合大语言模型和视觉属性的层次特征场,提升了开放词汇3D视觉基础任务中的空间推理能力。

Details Motivation: 现有的语言场方法在使用语言查询中的空间关系(如“椅子上的书”)精确定位实例方面存在困难,主要原因是语言查询和3D场景中的空间关系推理不足。 Method: 该方法通过微调一个大语言模型来捕捉空间关系,并明确推断目标、锚点和空间关系的指令;同时结合视觉属性(透明度和颜色)构建层次特征场,利用CLIP特征和SAM提取的掩码表示语言和实例特征,并按层级查询该特征场以定位目标3D实例。 Result: 实验表明,该框架可以无缝集成到不同的神经表示中,在3D视觉基础任务上优于基线模型,同时增强了它们的空间推理能力。 Conclusion: SpatialReasoner 是一种新的基于神经表示的框架,通过大语言模型驱动的空间推理来增强视觉属性的分层特征场,以实现开放词汇的3D视觉基础。 Abstract: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.

[88] Hierarchical Feature Alignment for Gloss-Free Sign Language Translation

Sobhan Asasi,Mohamed Ilyes Lakhal,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出一种基于伪词类标签和对比视频-语言对齐的层次化预训练策略,有效提升了无词类标注的手语翻译效果。

Details Motivation: 现有的端到端SLT方法在视觉和文本表示之间存在差异,而大语言模型(LLMs)的发展为无需词类标注的方法提供了可能。 Method: 引入了一种受手语结构启发的层次化预训练策略,结合伪词类标签和对比视频-语言对齐方法。 Result: 该方法通过层次化特征提取,在帧、片段和视频级别上进行对齐,提升了翻译质量。 Conclusion: 实验结果表明,所提出的方法在保持效率的同时提高了SLT的BLEU-4和ROUGE得分。 Abstract: Sign Language Translation (SLT) attempts to convert sign language videos into spoken sentences. However, many existing methods struggle with the disparity between visual and textual representations during end-to-end learning. Gloss-based approaches help to bridge this gap by leveraging structured linguistic information. While, gloss-free methods offer greater flexibility and remove the burden of annotation, they require effective alignment strategies. Recent advances in Large Language Models (LLMs) have enabled gloss-free SLT by generating text-like representations from sign videos. In this work, we introduce a novel hierarchical pre-training strategy inspired by the structure of sign language, incorporating pseudo-glosses and contrastive video-language alignment. Our method hierarchically extracts features at frame, segment, and video levels, aligning them with pseudo-glosses and the spoken sentence to enhance translation quality. Experiments demonstrate that our approach improves BLEU-4 and ROUGE scores while maintaining efficiency.

[89] MADPOT: Medical Anomaly Detection with CLIP Adaptation and Partial Optimal Transport

Mahshid Shiri,Cigdem Beyan,Vittorio Murino

Main category: cs.CV

TL;DR: 本文提出了一种高效的医学图像异常检测方法,结合视觉适配器、多提示学习与部分最优传输及对比学习,在无需合成数据或内存库的情况下取得了优异结果。

Details Motivation: 医学异常检测面临成像模态多样、解剖结构差异大以及标注数据有限等挑战,因此需要一种更高效且不依赖合成数据或内存库的方法来提升检测效果。 Method: 论文提出的方法使用了多提示学习,并通过POT对齐局部特征以捕捉细微的异常,同时引入对比学习增强类内聚性和类间分离性。 Result: 该方法在少量样本、零样本以及跨数据集场景下均达到了最先进的性能表现。 Conclusion: 该论文提出了一种结合视觉适配器和提示学习的新方法,利用部分最优传输(POT)和对比学习(CL),显著提高了CLIP在医学图像异常检测(AD)任务中的适应能力。 Abstract: Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP's adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at https://github.com/mahshid1998/MADPOT.

[90] Residual Prior-driven Frequency-aware Network for Image Fusion

Guan Zheng,Xue Wang,Wenhua Qian,Peng Liu,Runzhuo Ma

Main category: cs.CV

TL;DR: This paper proposes RPFNet for image fusion, which addresses computational costs and captures complementary features effectively through a dual-branch framework and enhanced training strategy.

Details Motivation: Image fusion aims to integrate complementary information across modalities for high-quality fused images. However, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs, and the absence of ground-truth exacerbates capturing complementary features effectively. Method: The proposed RPFNet includes a dual-branch feature extraction framework with Residual Prior Module (RPM) and Frequency Domain Fusion Module (FDFM), enhanced by a Cross Promotion Module (CPM). It also incorporates an auxiliary decoder, saliency structure loss, adaptive weight-based frequency contrastive loss, and SSIM loss during training. Result: Extensive experiments validate that RPFNet effectively integrates discriminative features, enhances texture details and salient objects, and improves the performance of high-level vision tasks. Conclusion: RPFNet can effectively integrate discriminative features, enhance texture details and salient objects, and facilitate the deployment of high-level vision tasks. Abstract: Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model's sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.

[91] DIFFUMA: High-Fidelity Spatio-Temporal Video Prediction via Dual-Path Mamba and Diffusion Enhancement

Xinyu Xie,Weifeng Cao,Jun Shi,Yangyang Hu,Hui Liang,Wanyong Liang,Xiaoliang Qian

Main category: cs.CV

TL;DR: This paper introduces CHDL, the first public dataset for semiconductor wafer dicing processes, and proposes DIFFUMA, an innovative model that significantly improves spatio-temporal video prediction accuracy for industrial applications.

Details Motivation: The motivation stems from the lack of specialized benchmark datasets for high-precision industrial scenarios like semiconductor manufacturing, which hinders progress in modeling and predicting complex processes. Method: The authors propose DIFFUMA, a dual-path prediction architecture combining a parallel Mamba module for capturing global temporal context and a diffusion module for restoring spatial details, evaluated through experiments on the CHDL dataset. Result: DIFFUMA achieves a 39% reduction in Mean Squared Error (MSE) and improves Structural Similarity (SSIM) from 0.926 to 0.988 on the CHDL dataset, demonstrating superior performance compared to existing methods. Conclusion: The paper concludes that DIFFUMA outperforms existing methods in spatio-temporal video prediction for industrial scenarios, particularly on the CHDL dataset, and emphasizes its potential to drive future research in industrial AI. Abstract: Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold contribution.First, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin development.Second, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.

[92] PromptTea: Let Prompts Tell TeaCache the Optimal Threshold

Zishen Huang,Chunyu Yang,Mengyuan Ren

Main category: cs.CV

TL;DR: This paper proposes PCA caching and DynCFGCache to accelerate video generation inference while preserving quality by adaptively adjusting reuse strategies based on prompt-derived complexity and output variation estimation.

Details Motivation: Existing caching mechanisms, such as fixed-frequency reuse and TeaCache, either degrade quality in complex scenes or suffer from poor input-output modeling. Manual tuning is inefficient and lacks robustness, necessitating a more adaptive solution. Method: The authors proposed Prompt-Complexity-Aware (PCA) caching to adjust reuse thresholds based on scene complexity from input prompts. They enhanced TeaCache by decoupling noisy inputs and applying multivariate polynomial feature expansion. They also introduced DynCFGCache, a dynamic mechanism for selectively reusing CFG outputs. Result: Experiments showed significant acceleration, including a 2.79x speedup on the Wan2.1 model, while maintaining high visual fidelity across various scenes. Conclusion: The paper concludes that PCA caching and DynCFGCache significantly improve video generation inference speed while maintaining high visual fidelity, overcoming limitations of prior methods like TeaCache and CFGCache. Abstract: Despite recent progress in video generation, inference speed remains a major bottleneck. A common acceleration strategy involves reusing model outputs via caching mechanisms at fixed intervals. However, we find that such fixed-frequency reuse significantly degrades quality in complex scenes, while manually tuning reuse thresholds is inefficient and lacks robustness. To address this, we propose Prompt-Complexity-Aware (PCA) caching, a method that automatically adjusts reuse thresholds based on scene complexity estimated directly from the input prompt. By incorporating prompt-derived semantic cues, PCA enables more adaptive and informed reuse decisions than conventional caching methods. We also revisit the assumptions behind TeaCache and identify a key limitation: it suffers from poor input-output relationship modeling due to an oversimplified prior. To overcome this, we decouple the noisy input, enhance the contribution of meaningful textual information, and improve the model's predictive accuracy through multivariate polynomial feature expansion. To further reduce computational cost, we replace the static CFGCache with DynCFGCache, a dynamic mechanism that selectively reuses classifier-free guidance (CFG) outputs based on estimated output variations. This allows for more flexible reuse without compromising output quality. Extensive experiments demonstrate that our approach achieves significant acceleration-for example, 2.79x speedup on the Wan2.1 model-while maintaining high visual fidelity across a range of scenes.

[93] Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching

Yafei Zhang,Yongle Shang,Huafeng Li

Main category: cs.CV

TL;DR: This paper proposes a local-and-global dual-granularity identity association mechanism to improve weakly supervised text-to-person image matching by better capturing subtle differences, handling weak associations, and enhancing robustness through dynamic adjustments and consistency learning.

Details Motivation: Weakly supervised text-to-person image matching is crucial for reducing reliance on manually labeled samples. However, existing methods struggle with predicting complex one-to-many identity relationships, limiting performance improvements. Method: A local-and-global dual-granularity identity association mechanism is introduced. At the local level, cross-modal identity relationships are established within a batch to reinforce constraints across modalities. At the global level, a dynamic cross-modal identity association network with a confidence-based adjustment mechanism is introduced. Additionally, an information-asymmetric sample pair construction method combined with consistency learning is proposed. Result: Experimental results demonstrate that the proposed method significantly enhances cross-modal matching accuracy, improves the model's ability to identify weakly associated samples, and enhances overall sensitivity and robustness. Conclusion: The proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching. Abstract: Weakly supervised text-to-person image matching, as a crucial approach to reducing models' reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model's ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching.

[94] Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu

Yan Hon Michael Chung,Donghyeok Choi

Main category: cs.CV

TL;DR: 本文提出了一种高效的满文OCR系统,通过微调三种开源视觉-语言模型(LLaMA-3.2-11B、Qwen2.5-VL-7B、Qwen2.5-VL-3B),在合成数据和真实世界手写文档中均表现出色,实现了从合成到现实领域的有效迁移。

Details Motivation: 满文是一种濒危语言,对理解早期现代东亚历史至关重要,但缺乏能够处理真实历史文档的有效OCR系统。 Method: 使用参数高效训练方法,在60,000张合成满文单词图像上微调三种视觉-语言模型,并进行了比较评估。 Result: LLaMA-3.2-11B在合成数据上达到了98.3%的单词准确率和0.0024的字符错误率,在真实手写文档中保持了93.1%的准确率,显著优于传统方法(如CRNN基线在真实文档中仅达到72.5%的准确率)。 Conclusion: 本研究建立了一个可转移的濒危语言OCR框架,降低了数字人文学科的技术和经济障碍,使历史学家和语言学家无需专业计算资源即可处理历史档案。 Abstract: Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.

[95] Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs

Yahan Yu,Yuyang Dong,Masafumi Oyamada

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态大语言模型推理框架D2I,该框架通过在训练过程中使用基于规则的格式奖励来增强模态对齐,而无需额外注释或复杂奖励,并且在评估时切换到直觉推理风格,结果表明D2I在多个基准测试中表现优异。

Details Motivation: 多模态推理研究需要进一步探索模态对齐和训练成本问题,许多现有方法依赖于额外的数据标注和相关的基于规则的奖励,显著增加了训练成本并限制了可扩展性。 Method: 提出了一种名为Deliberate-to-Intuitive推理框架(D2I)的方法,通过设置仅依赖于规则的格式奖励的深思熟虑推理策略来增强模态对齐,而在评估时切换到直觉推理风格。 Result: D2I在领域内和领域外基准测试中均优于基线方法,强调了格式奖励在培养MLLMs可迁移推理技能中的作用,并启发了将训练时推理深度与测试时响应灵活性解耦的方向。 Conclusion: D2I通过在训练时利用基于规则的格式奖励增强模态对齐,而无需额外注释或复杂奖励,在提升多模态大语言模型理解和推理能力方面优于基线方法。此外,该方法允许在评估时切换到直觉推理风格,从而隐式反映模型获得的能力。 Abstract: Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model's acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.

[96] FOLC-Net: A Federated-Optimized Lightweight Architecture for Enhanced MRI Disease Diagnosis across Axial, Coronal, and Sagittal Views

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Main category: cs.CV

TL;DR: FOLC-Net is a new framework designed to improve MRI disease diagnosis by addressing performance degradation in current models when processing different anatomical planes.

Details Motivation: It specifically addresses the performance degradation observed in state-of-the-art (SOTA) models, particularly when processing axial, coronal, and sagittal anatomical planes. Method: The paper introduces FOLC-Net, which incorporates a novel federated-optimized lightweight architecture with approximately 1.217 million parameters and a storage requirement of only 0.9 MB. FOLC-Net integrates Manta-ray foraging optimization (MRFO) mechanisms for efficient model structure generation, global model cloning for scalable training, and ConvNeXt for enhanced client adaptability. Result: The results show that FOLC-Net outperforms existing models, particularly in the challenging sagittal view. For instance, FOLC-Net achieved an accuracy of 92.44% on the sagittal view, significantly higher than the 88.37% accuracy of study method (DL + Residual Learning) and 88.95% of DL models. Additionally, FOLC-Net demonstrated improved accuracy across all individual views. Conclusion: FOLC-Net addresses the limitations of existing SOTA models by providing a framework that ensures better adaptability to individual views while maintaining strong performance in multi-view settings. Abstract: The framework is designed to improve performance in the analysis of combined as well as single anatomical perspectives for MRI disease diagnosis. It specifically addresses the performance degradation observed in state-of-the-art (SOTA) models, particularly when processing axial, coronal, and sagittal anatomical planes. The paper introduces the FOLC-Net framework, which incorporates a novel federated-optimized lightweight architecture with approximately 1.217 million parameters and a storage requirement of only 0.9 MB. FOLC-Net integrates Manta-ray foraging optimization (MRFO) mechanisms for efficient model structure generation, global model cloning for scalable training, and ConvNeXt for enhanced client adaptability. The model was evaluated on combined multi-view data as well as individual views, such as axial, coronal, and sagittal, to assess its robustness in various medical imaging scenarios. Moreover, FOLC-Net tests a ShallowFed model on different data to evaluate its ability to generalize beyond the training dataset. The results show that FOLC-Net outperforms existing models, particularly in the challenging sagittal view. For instance, FOLC-Net achieved an accuracy of 92.44% on the sagittal view, significantly higher than the 88.37% accuracy of study method (DL + Residual Learning) and 88.95% of DL models. Additionally, FOLC-Net demonstrated improved accuracy across all individual views, providing a more reliable and robust solution for medical image analysis in decentralized environments. FOLC-Net addresses the limitations of existing SOTA models by providing a framework that ensures better adaptability to individual views while maintaining strong performance in multi-view settings. The incorporation of MRFO, global model cloning, and ConvNeXt ensures that FOLC-Net performs better in real-world medical applications.

[97] Unlocking Thermal Aerial Imaging: Synthetic Enhancement of UAV Datasets

Antonella Barisic Kulas,Andreja Jurasovic,Stjepan Bogdan

Main category: cs.CV

TL;DR: 本文介绍了一种生成合成热成像的新方法,用于解决空中视角数据不足的问题,并通过实验验证其有效性。

Details Motivation: 由于大规模多样化空中热成像数据集的缺乏限制了深度学习模型的发展,因此需要一种高效的方法来生成相关数据。 Method: 提出了一种从空中视角生成合成热成像的新型过程管线,能够将任意对象类别集成到现有热背景中,并控制位置、比例和方向。 Result: 成功向HIT-UAV和MONET数据集中添加了新的对象类别(如无人机和动物),并在目标检测任务中展示了优异性能。 Conclusion: 合成热成像数据的方法有效扩展了现有数据集的应用范围,并验证了热成像探测器在性能上优于可见光训练模型。 Abstract: Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: https://github.com/larics/thermal_aerial_synthetic.

[98] GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif,Arthur Ouaknine,Luke A. Brown,Phuong D. Dao,Kyle R. Kovach,Bing Lu,Daniel Mederer,Hannes Feilhauer,Teja Kattenborn,David Rolnick

Main category: cs.CV

TL;DR: This paper introduces GreenHyperSpectra, a novel dataset and framework that enhances plant trait prediction using semi- and self-supervised machine learning techniques.

Details Motivation: Conventional field sampling is insufficient for capturing plant trait variation at large spatial scales, and machine learning using hyperspectral data offers a promising alternative. Method: The authors created GreenHyperSpectra, a pretraining dataset with cross-sensor and cross-ecosystem samples, and evaluated semi- and self-supervised models against supervised baselines. Result: Label-efficient multi-output regression models pretrained on GreenHyperSpectra outperformed state-of-the-art supervised methods in both in-distribution and out-of-distribution scenarios. Conclusion: GreenHyperSpectra successfully improves plant trait prediction through semi- and self-supervised learning, providing a robust framework for future research. Abstract: Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

[99] Democratizing High-Fidelity Co-Speech Gesture Video Generation

Xu Yang,Shaoli Huang,Shenbo Xie,Xuelin Chen,Yifei Liu,Changxing Ding

Main category: cs.CV

TL;DR: 这项研究提出了一种基于2D骨骼和扩散模型的轻量级框架,用于生成高质量、与音频同步的说话人视频,并发布了首个大规模公共数据集CSG-405。

Details Motivation: 由于音频与视觉内容之间存在显著的一对多映射,加之大规模公共数据集的稀缺和计算需求较高,这一任务带来了挑战。 Method: 我们提出了一种轻量级框架,利用2D全身骨骼作为高效的辅助条件,弥合音频信号与视觉输出之间的差距。该方法引入了一个基于精细音频片段和从说话人参考图像中提取的骨骼信息的扩散模型,通过骨骼-音频特征融合来预测骨骼运动,以确保严格的音频协调性和身体形状一致性。生成的骨骼随后输入到现成的人体视频生成模型中,并结合说话人的参考图像合成高保真视频。 Result: 为了推动研究普及,我们发布了CSG-405——首个公开的数据集,包含71种语音类型、跨越405小时的高清视频,并用2D骨骼进行了标注,涵盖了多样化的说话人人口统计学信息。 Conclusion: 实验表明,该方法在视觉质量和同步性方面超过了现有技术,并能跨说话人和场景进行泛化。 Abstract: Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.

[100] HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement

Qingsen Yan,Kangbiao Shi,Yixu Feng,Tao Hu,Peng Wu,Guansong Pang,Yanning Zhang

Main category: cs.CV

TL;DR: This paper proposes a new color space and network for enhancing low-light images, effectively addressing color bias, brightness artifacts, and noise while delivering improved performance over existing approaches.

Details Motivation: Existing methods in standard RGB and HSV color spaces produce color bias, brightness artifacts, red and black noise. This research aims to overcome these issues by proposing a new color space and network architecture. Method: A new color space, Horizontal/Vertical-Intensity (HVI), was introduced along with the Color and Intensity Decoupling Network+ (HVI-CIDNet+) which utilizes a Prior-guided Attention Block and a Region Refinement Block to enhance low-light images. Result: The benchmark experiments showed that HVI-CIDNet+ delivers superior performance over current state-of-the-art techniques across multiple datasets. Conclusion: The proposed HVI-CIDNet+ outperforms state-of-the-art methods on 10 datasets for Low-Light Image Enhancement. Abstract: Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.

[101] Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation

Tao Feng,Xianbing Zhao,Zhenhua Chen,Tien Tsin Wong,Hamid Rezatofighi,Gholamreza Haffari,Lizhen Qu

Main category: cs.CV

TL;DR: 提出一种融合符号回归与轨迹引导生成模型的框架,提升视频生成的物理规律对齐能力。

Details Motivation: 现有扩散或自回归视频生成模型依赖统计相关性而非物理规律,缺乏真实世界动态准确性。 Method: 引入结合符号回归(SR)和轨迹引导图像到视频(I2V)模型的新框架,通过检索预训练机制增强符号回归,发现运动方程并预测物理精确的轨迹。 Result: 在弹簧振子、单摆和抛体运动等任务中实现了物理对齐的视频预测,且无需微调现有模型。 Conclusion: 该框架成功恢复了真实场景下的解析方程,并在经典力学场景中提升了生成视频的物理对齐能力。 Abstract: Recent advances in diffusion-based and autoregressive video generation models have achieved remarkable visual realism. However, these models typically lack accurate physical alignment, failing to replicate real-world dynamics in object motion. This limitation arises primarily from their reliance on learned statistical correlations rather than capturing mechanisms adhering to physical laws. To address this issue, we introduce a novel framework that integrates symbolic regression (SR) and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting. Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories. These trajectories then guide video generation without requiring fine-tuning of existing models. Evaluated on scenarios in Classical Mechanics, including spring-mass, pendulums, and projectile motions, our method successfully recovers ground-truth analytical equations and improves the physical alignment of generated videos over baseline methods.

[102] Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation

Joelle Hanna,Damian Borth

Main category: cs.CV

TL;DR: This paper proposes an end-to-end method for Weakly Supervised Semantic Segmentation using Vision Transformers, eliminating reliance on external modules and generating high-quality pseudo masks through attention map aggregation.

Details Motivation: Traditional WSSS methods rely on external modules like Class Activation Maps, which can be limiting. This work aims to directly utilize ViT's attention maps in an end-to-end framework for better interpretability and performance. Method: Training a sparse Vision Transformer (ViT) with multiple [CLS] tokens and random masking strategy to generate pseudo segmentation masks by aggregating self-attention maps during inference. Result: The method outperforms related works on two standard benchmarks and three specialized datasets, generating accurate pseudo-masks that enable training segmentation models with performance close to fully supervised approaches. Conclusion: The proposed end-to-end method using sparse ViT with multiple [CLS] tokens improves the generation of pseudo segmentation masks, achieving results comparable to fully-supervised models while reducing the need for labeled data. Abstract: Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.

[103] IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization

Subrat Kishore Dutta,Xiao Zhang

Main category: cs.CV

TL;DR: This paper proposes IAP, an improved framework for generating stealthier and more contextually coherent adversarial patches that evade detection by both humans and modern defenses.

Details Motivation: Prior adversarial patch methods fail to balance attack effectiveness with contextual coherence, making them noticeable by humans or detectable by automatic defenses. This work aims to create more stealthy and targeted adversarial patches. Method: IAP uses perceptibility-aware localization and perturbation optimization. It leverages classwise localization and sensitivity maps to find optimal patch locations and employs a perceptibility-regularized adversarial loss with a gradient update rule prioritizing color constancy. Result: IAP achieves competitive attack success rates while significantly improving patch invisibility across various benchmarks and model architectures. Conclusion: IAP is effective in generating highly invisible adversarial patches that are both imperceptible to humans and capable of rendering state-of-the-art patch defenses ineffective. Abstract: Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.

[104] Longitudinal Study of Facial Biometrics at the BEZ: Temporal Variance Analysis

Mathias Schulz,Alexander Spenke,Pia Funk,Florian Blümel,Markus Rohde,Ralph Breithaupt,Gerd Nolden,Norbert Jung,Robert Lange

Main category: cs.CV

TL;DR: 该研究通过长达两年半的生物特征评估,发现面部识别评分在不同日期间的波动较大,强调了在受控环境中对同一受试者进行长期测试的重要性。

Details Motivation: 为了更好地理解生物特征在不同个体和时间上的变化,以及这些变化如何影响生物识别技术的准确性和可靠性。 Method: 使用符合通用数据保护条例(GDPR)的本地bez数据库,包含超过238,000个生物特征数据集,采用最先进的面部识别算法分析长期比较得分。 Result: 结果表明,面部识别评分在不同日期之间的波动比整个测量期间更为显著。 Conclusion: 研究强调了在受控环境中对同一受试者进行长期生物特征测试的重要性,并为未来生物特征数据分析的进步奠定了基础。 Abstract: This study presents findings from long-term biometric evaluations conducted at the Biometric Evaluation Center (bez). Over the course of two and a half years, our ongoing research with over 400 participants representing diverse ethnicities, genders, and age groups were regularly assessed using a variety of biometric tools and techniques at the controlled testing facilities. Our findings are based on the General Data Protection Regulation-compliant local bez database with more than 238.000 biometric data sets categorized into multiple biometric modalities such as face and finger. We used state-of-the-art face recognition algorithms to analyze long-term comparison scores. Our results show that these scores fluctuate more significantly between individual days than over the entire measurement period. These findings highlight the importance of testing biometric characteristics of the same individuals over a longer period of time in a controlled measurement environment and lays the groundwork for future advancements in biometric data analysis.

[105] SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds

Matthias Zeller,Daniel Casado Herraez,Bengisu Ayan,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss

Main category: cs.CV

TL;DR: 本文提出了SemRaFiner方法,在稀疏雷达点云上实现高效全景分割,有效提升了自动驾驶车辆在恶劣天气下的场景理解能力。

Details Motivation: 为了解决现有传感器(如相机和LiDAR)在恶劣天气下表现受限且无法直接提供运动信息的问题,同时克服雷达点云稀疏性和噪声带来的挑战,需要更有效的语义场景理解方法。 Method: 提出了一种名为SemRaFiner的方法,针对稀疏雷达点云密度变化进行特征提取优化,并设计了改进的训练策略和数据增强方法。 Result: 实验表明,SemRaFiner在雷达点云的全景分割性能优于当前最先进的方法。 Conclusion: SemRaFiner在雷达点云的全景分割任务中优于现有方法,提升了自动驾驶场景理解能力。 Abstract: Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.

[106] Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement

Qiyuan Dai,Hanzhuo Huang,Yu Wu,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的通用类别发现方法APL,通过自适应对象部分建模和新型对比损失,在保持泛化性的同时提高了判别能力。

Details Motivation: 现有基于DINO CLS token的方法在判别性和泛化性之间存在固有折衷,无法有效处理细粒度类别区分与知识迁移。 Method: 利用共享的可学习部分查询和DINO部分先验生成一致的对象部分及其对应关系,并提出一种all-min对比损失以平衡判别性和泛化性。 Result: APL在多个GCD框架上实现了显著性能提升,尤其在细粒度数据集上表现优异。 Conclusion: APL方法通过自适应部分发现和学习,结合all-min对比损失,在细粒度数据集上的GCD任务中表现出显著提升。 Abstract: Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.

[107] MCCD: A Multi-Attribute Chinese Calligraphy Character Dataset Annotated with Script Styles, Dynasties, and Calligraphers

Yixin Zhao,Yuyi Zhang,Lianwen Jin

Main category: cs.CV

TL;DR: This paper introduces the MCCD dataset, which addresses the lack of detailed attribute information in existing Chinese calligraphy datasets, enabling more in-depth research.

Details Motivation: Research on Chinese calligraphy attributes is culturally and historically valuable, but existing datasets are scarce and lack detailed attribute information. Method: The researchers created the Multi-Attribute Chinese Calligraphy Character Dataset (MCCD) with rich attribute annotations and conducted single-task and multi-task recognition experiments. Result: The MCCD dataset contains 7,765 categories with 329,715 image samples and three subsets based on script styles, dynasties, and calligraphers, making it suitable for various research tasks. Conclusion: The paper concludes that the MCCD dataset fills a significant gap in Chinese calligraphy research and offers valuable resources for future advancements. Abstract: Research on the attribute information of calligraphy, such as styles, dynasties, and calligraphers, holds significant cultural and historical value. However, the styles of Chinese calligraphy characters have evolved dramatically through different dynasties and the unique touches of calligraphers, making it highly challenging to accurately recognize these different characters and their attributes. Furthermore, existing calligraphic datasets are extremely scarce, and most provide only character-level annotations without additional attribute information. This limitation has significantly hindered the in-depth study of Chinese calligraphy. To fill this gap, we present a novel Multi-Attribute Chinese Calligraphy Character Dataset (MCCD). The dataset encompasses 7,765 categories with a total of 329,715 isolated image samples of Chinese calligraphy characters, and three additional subsets were extracted based on the attribute labeling of the three types of script styles (10 types), dynasties (15 periods) and calligraphers (142 individuals). The rich multi-attribute annotations render MCCD well-suited diverse research tasks, including calligraphic character recognition, writer identification, and evolutionary studies of Chinese characters. We establish benchmark performance through single-task and multi-task recognition experiments across MCCD and all of its subsets. The experimental results demonstrate that the complexity of the stroke structure of the calligraphic characters, and the interplay between their different attributes, leading to a substantial increase in the difficulty of accurate recognition. MCCD not only fills a void in the availability of detailed calligraphy datasets but also provides valuable resources for advancing research in Chinese calligraphy and fostering advancements in multiple fields. The dataset is available at https://github.com/SCUT-DLVCLab/MCCD.

[108] Pre-Columbian Settlements Shaped Palm Clusters in the Sierra Nevada de Santa Marta, Colombia

Sebastian Fajardo,Sina Mohammadi,Jonas Gregorio de Souza,César Ardila,Alan Tapscott Baltar,Shaddai Heidgen,Maria Isabel Mayorga Hernández,Sylvia Mota de Oliveira,Fernando Montejo,Marco Moderato,Vinicius Peripato,Katy Puche,Carlos Reina,Juan Carlos Vargas,Frank W. Takes,Marco Madella

Main category: cs.CV

TL;DR: 这项研究通过结合人工智能、生态和考古数据,揭示了史前人类如何通过改变植被模式留下生态足迹,以及这种变化如何影响当时基础设施建设的物流成本。

Details Motivation: 理解古代人类管理对热带森林长期影响的需求,尤其是在高分辨率尺度上,促使了这项研究的开展。 Method: 研究提出了一种新方法来调查考古学影响区域,基于植被特征。具体包括:训练一个深度学习模型以识别卫星图像上的棕榈树,随后使用聚类算法识别棕榈树聚集区,并利用这些聚集区估计古代管理区域。 Result: 结果表明,在具有大型基础设施投资的考古遗址附近,棕榈树明显更加丰富。最大的棕榈树聚集区范围表明,与主要基础设施遗址相关的古代人类管理区域可能是仅凭考古证据所显示范围的两个数量级大。 Conclusion: 该研究得出的结论是,前哥伦布时期的人口通过影响当地植被,促进了适合棕榈树生长的环境,留下了持久的生态足迹。这种影响可能降低了在原本难以到达的地方建立基础设施密集型定居点的物流成本。 Abstract: Ancient populations markedly transformed Neotropical forests, yet understanding the long-term effects of ancient human management, particularly at high-resolution scales, remains challenging. In this work we propose a new approach to investigate archaeological areas of influence based on vegetation signatures. It consists of a deep learning model trained on satellite imagery to identify palm trees, followed by a clustering algorithm to identify palm clusters, which are then used to estimate ancient management areas. To assess the palm distribution in relation to past human activity, we applied the proposed approach to unique high-resolution satellite imagery data covering 765 km2 of the Sierra Nevada de Santa Marta, Colombia. With this work, we also release a manually annotated palm tree dataset along with estimated locations of archaeological sites from ground-surveys and legacy records. Results demonstrate how palms were significantly more abundant near archaeological sites showing large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced local vegetation fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in otherwise less accessible locations. Overall, this study demonstrates the potential of integrating artificial intelligence approaches with new ecological and archaeological data to identify archaeological areas of interest through vegetation patterns, revealing fine-scale human-environment interactions.

[109] CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale

Xiao Liang,Jiawei Hu,Di Wang,Zhi Ma,Lin Zhao,Ronghan Li,Bo Wan,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CheXPO的新方法,用于解决视觉-语言模型在医疗应用中的幻觉问题。

Details Motivation: 为了解决视觉-语言模型在医疗应用中容易出现的幻觉问题,以及偏好的优化在实施过程中面临的挑战,如临床无关的训练样本、数据分布不平衡和专家注释成本过高等问题。 Method: 引入了CheXPO,一种结合置信度-相似性联合挖掘与反事实推理的胸部X光偏好优化策略。 Result: 实验表明,CheXPO在各种临床任务中均达到了最先进的性能。 Conclusion: CheXPO提供了一种可扩展且可解释的解决方案,用于现实世界的放射学应用,并在仅使用5%的SFT样本的情况下实现了8.93%的相对性能增益。 Abstract: Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.

[110] Segmentation Regularized Training for Multi-Domain Deep Learning Registration applied to MR-Guided Prostate Cancer Radiotherapy

Sudharsan Madhavan,Chengcheng Gui,Lando Bosma,Josiah Simeth,Jue Jiang,Nicolas Cote,Nima Hassan Rezaeian,Himanshu Nagar,Victoria Brennan,Neelam Tyagi,Harini Veeraraghavan

Main category: cs.CV

TL;DR: This study developed a deep learning DIR method called ProRSeg for domain invariant MR-MR registration in MR-guided adaptive radiotherapy. It showed promising results for contour propagation and dose accumulation in prostate cancer patients.

Details Motivation: Accurate deformable image registration (DIR) is required for contour propagation and dose accumulation in MR-guided adaptive radiotherapy (MRgART). Method: A progressively refined registration and segmentation (ProRSeg) method was trained with 262 pairs of 3T MR simulation scans from prostate cancer patients using weighted segmentation consistency loss. Result: ProRSeg demonstrated generalization for bladder with similar Dice Similarity Coefficients across domains (0.88, 0.87, 0.86). For rectum and CTV, performance was domain-dependent with higher accuracy on cross-domain MRL dataset (DSCs 0.89) versus same-domain data. Dose accumulation showed 83.3% of patients met CTV coverage (D95 >= 40.0 Gy) and bladder sparing (D50 <= 20.0 Gy) constraints. Conclusion: ProRSeg showed reasonable multi-domain MR-MR registration performance for prostate cancer patients with preliminary feasibility for evaluating treatment compliance to clinical constraints. Abstract: Background: Accurate deformable image registration (DIR) is required for contour propagation and dose accumulation in MR-guided adaptive radiotherapy (MRgART). This study trained and evaluated a deep learning DIR method for domain invariant MR-MR registration. Methods: A progressively refined registration and segmentation (ProRSeg) method was trained with 262 pairs of 3T MR simulation scans from prostate cancer patients using weighted segmentation consistency loss. ProRSeg was tested on same- (58 pairs), cross- (72 1.5T MR Linac pairs), and mixed-domain (42 MRSim-MRL pairs) datasets for contour propagation accuracy of clinical target volume (CTV), bladder, and rectum. Dose accumulation was performed for 42 patients undergoing 5-fraction MRgART. Results: ProRSeg demonstrated generalization for bladder with similar Dice Similarity Coefficients across domains (0.88, 0.87, 0.86). For rectum and CTV, performance was domain-dependent with higher accuracy on cross-domain MRL dataset (DSCs 0.89) versus same-domain data. The model's strong cross-domain performance prompted us to study the feasibility of using it for dose accumulation. Dose accumulation showed 83.3% of patients met CTV coverage (D95 >= 40.0 Gy) and bladder sparing (D50 <= 20.0 Gy) constraints. All patients achieved minimum mean target dose (>40.4 Gy), but only 9.5% remained under upper limit (<42.0 Gy). Conclusions: ProRSeg showed reasonable multi-domain MR-MR registration performance for prostate cancer patients with preliminary feasibility for evaluating treatment compliance to clinical constraints.

[111] Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting

Fei Teng,Kai Luo,Sheng Wu,Siyu Li,Pujun Guo,Jiale Wei,Kunyu Peng,Jiaming Zhang,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为Percep360的全景生成方法,解决了自动驾驶中全景数据采集复杂以及现有模型无法实现高质量、可控生成的问题。

Details Motivation: 现有的街景生成模型只能从现有数据集的固定数据分布中学习,无法实现高质量、可控的全景生成。 Method: 提出了Local Scenes Diffusion Method (LSDM)和Probabilistic Prompting Method (PPM)来解决信息损失和实现可控生成。 Result: 生成的数据在无参考质量指标上优于原始拼接图像,并提高了下游感知模型的性能。 Conclusion: Percep360是一个用于自动驾驶的全景生成方法,能够实现高质量和可控的全景数据生成,并在图像质量、可控制性和实用性方面表现出色。 Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360{\deg} surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird's Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360.

[112] A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level

Johanna Orsholm,John Quinto,Hannu Autto,Gaia Banelyte,Nicolas Chazot,Jeremy deWaard,Stephanie deWaard,Arielle Farrell,Brendan Furneaux,Bess Hardwick,Nao Ito,Amlan Kar,Oula Kalttopää,Deirdre Kerdraon,Erik Kristensen,Jaclyn McKeown,Tommi Mononen,Ellen Nein,Hanna Rogers,Tomas Roslin,Paula Schmitz,Jayme Sones,Maija Sujala,Amy Thompson,Evgeny V. Zakharov,Iuliia Zarubiieva,Akshita Gupta,Scott C. Lowe,Graham W. Taylor

Main category: cs.CV

TL;DR: This paper presents a new dataset, MassID45, for training automatic classifiers of bulk insect samples, combining molecular and imaging data.

Details Motivation: Many insect species are experiencing severe population declines under environmental and habitat changes. High-throughput approaches such as DNA barcoding and high-resolution imaging have potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. Method: The paper introduces the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17,000 specimens. Result: A unique dataset combining molecular and imaging data at both the unsorted sample level and the full set of individual specimens was created. It enables training automatic classifiers of bulk insect samples. Conclusion: The paper concludes that combining DNA barcodes' taxonomic resolution with bulk images' precise abundance estimates can enable rapid, large-scale characterization of insect communities. The dataset introduced pushes the boundaries of tiny object detection and instance segmentation. Abstract: Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.

[113] Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM

Qiyuan Dai,Sibei Yang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data. Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access--a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution settings.

[114] DenoiseCP-Net: Efficient Collective Perception in Adverse Weather via Joint LiDAR-Based 3D Object Detection and Denoising

Sven Teufel,Dominique Mayer,Jörg Gamerdinger,Oliver Bringmann

Main category: cs.CV

TL;DR: 本文提出了一种名为DenoiseCP-Net的新方法,用于在恶劣天气条件下实现基于LiDAR的集体感知,该方法有效减少了通信带宽需求,同时保持了检测准确性和推断延迟。

Details Motivation: 自动车辆的感知系统在恶劣天气条件下容易受到传感器退化的影响,而集体感知提供了一种有前景的方法来克服这些限制。然而,目前对恶劣天气下的集体感知研究较少。 Method: 集成体素级噪声过滤和物体检测到统一的稀疏卷积主干中,并扩展OPV2V数据集以模拟雨、雪和雾的条件。 Result: DenoiseCP-Net实现了近乎完美的去噪准确性,减少了带宽需求高达23.6%,并降低了合作车辆的推理延迟。 Conclusion: DenoiseCP-Net是一个新的多任务架构,可以有效地在恶劣天气条件下进行基于LiDAR的集体感知,同时减少带宽需求并保持检测准确性和推断延迟。 Abstract: While automated vehicles hold the potential to significantly reduce traffic accidents, their perception systems remain vulnerable to sensor degradation caused by adverse weather and environmental occlusions. Collective perception, which enables vehicles to share information, offers a promising approach to overcoming these limitations. However, to this date collective perception in adverse weather is mostly unstudied. Therefore, we conduct the first study of LiDAR-based collective perception under diverse weather conditions and present a novel multi-task architecture for LiDAR-based collective perception under adverse weather. Adverse weather conditions can not only degrade perception capabilities, but also negatively affect bandwidth requirements and latency due to the introduced noise that is also transmitted and processed. Denoising prior to communication can effectively mitigate these issues. Therefore, we propose DenoiseCP-Net, a novel multi-task architecture for LiDAR-based collective perception under adverse weather conditions. DenoiseCP-Net integrates voxel-level noise filtering and object detection into a unified sparse convolution backbone, eliminating redundant computations associated with two-stage pipelines. This design not only reduces inference latency and computational cost but also minimizes communication overhead by removing non-informative noise. We extended the well-known OPV2V dataset by simulating rain, snow, and fog using our realistic weather simulation models. We demonstrate that DenoiseCP-Net achieves near-perfect denoising accuracy in adverse weather, reduces the bandwidth requirements by up to 23.6% while maintaining the same detection accuracy and reducing the inference latency for cooperative vehicles.

[115] MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation

Qilong Xing,Zikai Song,Youjia Zhang,Na Feng,Junqing Yu,Wei Yang

Main category: cs.CV

TL;DR: MCA-RG improves radiology report generation by aligning visual features with medical concepts, outperforming existing methods on benchmark datasets.

Details Motivation: The motivation stems from challenges in clinical adoption of LLMs for radiology due to inaccurate mapping of pathological/anatomical features to text descriptions and semantic agnostic feature extraction. Method: The method involves aligning visual features with medical concepts using curated pathology and anatomy banks, contrastive learning, matching loss, and a feature gating mechanism. Result: Experiments demonstrated that MCA-RG achieves better performance on two public datasets (MIMIC-CXR and CheXpert Plus). Conclusion: MCA-RG is an effective framework for radiology report generation, showing superior performance on public benchmarks. Abstract: Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.

[116] Cross-Modality Masked Learning for Survival Prediction in ICI Treated NSCLC Patients

Qilong Xing,Zikai Song,Bingxin Gong,Lian Yang,Junqing Yu,Wei Yang

Main category: cs.CV

TL;DR: 本文介绍了一种针对非小细胞肺癌患者免疫治疗后生存预测的新型多模态特征融合框架,并提供了一个大规模数据集,有效提升了预测准确性。

Details Motivation: 个性化治疗规划需要对接受免疫治疗的非小细胞肺癌患者的预后进行准确预测,但目前由于缺乏大规模相关数据集和有效的多模态特征融合策略而面临挑战。 Method: 该方法采用跨模态掩码学习策略,包含两个分支:用于从CT图像中提取3D特征的Slice-Depth Transformer和用于学习临床变量之间节点特征和关系的基于图的Transformer。通过使用完整模态重建缺失部分来指导融合过程,从而加强模态间特征的集成。 Result: 该方法在NSCLC生存预测的多模态整合中表现出优于现有技术的效果,提高了生存预测的准确性。 Conclusion: 该研究提出了一种新的多模态特征融合框架,并构建了一个用于非小细胞肺癌患者免疫治疗后生存预测的大规模数据集。这种方法在多模态整合方面表现出色,为这一领域的预后模型设定了新基准。 Abstract: Accurate prognosis of non-small cell lung cancer (NSCLC) patients undergoing immunotherapy is essential for personalized treatment planning, enabling informed patient decisions, and improving both treatment outcomes and quality of life. However, the lack of large, relevant datasets and effective multi-modal feature fusion strategies pose significant challenges in this domain. To address these challenges, we present a large-scale dataset and introduce a novel framework for multi-modal feature fusion aimed at enhancing the accuracy of survival prediction. The dataset comprises 3D CT images and corresponding clinical records from NSCLC patients treated with immune checkpoint inhibitors (ICI), along with progression-free survival (PFS) and overall survival (OS) data. We further propose a cross-modality masked learning approach for medical feature fusion, consisting of two distinct branches, each tailored to its respective modality: a Slice-Depth Transformer for extracting 3D features from CT images and a graph-based Transformer for learning node features and relationships among clinical variables in tabular data. The fusion process is guided by a masked modality learning strategy, wherein the model utilizes the intact modality to reconstruct missing components. This mechanism improves the integration of modality-specific features, fostering more effective inter-modality relationships and feature interactions. Our approach demonstrates superior performance in multi-modal integration for NSCLC survival prediction, surpassing existing methods and setting a new benchmark for prognostic models in this context.

[117] GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning

S M Taslim Uddin Raju,Md. Milon Islam,Md Rezwanul Haque,Hamdi Altaheri,Fakhri Karray

Main category: cs.CV

TL;DR: 本研究提出了一种名为GNN-ViTCap的新框架,用于从组织病理学显微图像中高效准确地进行分类和生成描述,解决了冗余补丁、未知补丁位置以及自动生成病理描述的挑战。

Details Motivation: 由于病理学家主观拍摄导致的冗余补丁和未知补丁位置的问题,以及自动生成病理描述仍然是重大挑战,因此需要对全切片图像(WSI)进行分类和描述生成。 Method: 引入了一种新的GNN-ViTCap框架用于从组织病理学显微图像中进行分类和生成描述。首先,视觉特征提取器生成补丁嵌入。然后通过深度嵌入聚类动态地去除这些嵌入中的冗余补丁,并通过标量点注意力机制选择代表性补丁。通过将每个节点连接到相似性矩阵中的最近邻居来构建图,并应用图神经网络以捕捉局部和全局上下文。聚合后的图像嵌入通过一个线性层投影到语言模型的输入空间,并与描述标记结合以微调大型语言模型。 Result: GNN-ViTCap在BreakHis和PatchGastric数据集上验证了其方法。GNN-ViTCap在分类任务中取得了0.934的F1分数和0.963的AUC值,在描述生成任务中取得了0.811的BLEU-4分数和0.569的METEOR分数。 Conclusion: GNN-ViTCap在组织病理学显微图像的分类和生成描述方面优于现有技术,为基于显微镜的患者诊断提供了一个可靠高效的解决方案。 Abstract: Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.

[118] Integrating Pathology Foundation Models and Spatial Transcriptomics for Cellular Decomposition from Histology Images

Yutong Sun,Sichen Zhu,Peng Qiu

Main category: cs.CV

TL;DR: 本研究开发了一种基于预训练病理基础模型的轻量级方法,能够高效准确地从H&E染色图像预测细胞类型组成,无需昂贵的空间转录组实验。

Details Motivation: 数字病理学和现代深度学习的发展促进了病理基础模型的出现,同时空间转录组学技术为基因表达分析提供了新的机会。研究旨在结合这些技术优势,实现高效的细胞组成预测。 Method: 通过利用从预训练病理基础模型中提取的信息丰富特征嵌入,训练一个轻量级的多层感知机(MLP)回归器来预测细胞类型丰度。 Result: 该方法在性能上与现有方法(如Hist2Cell)具有竞争力,同时显著降低了计算复杂性。 Conclusion: 提出了一种轻量级且训练高效的方法,可直接从H&E染色组织图像预测细胞组成。 Abstract: The rapid development of digital pathology and modern deep learning has facilitated the emergence of pathology foundation models that are expected to solve general pathology problems under various disease conditions in one unified model, with or without fine-tuning. In parallel, spatial transcriptomics has emerged as a transformative technology that enables the profiling of gene expression on hematoxylin and eosin (H&E) stained histology images. Spatial transcriptomics unlocks the unprecedented opportunity to dive into existing histology images at a more granular, cellular level. In this work, we propose a lightweight and training-efficient approach to predict cellular composition directly from H&E-stained histology images by leveraging information-enriched feature embeddings extracted from pre-trained pathology foundation models. By training a lightweight multi-layer perceptron (MLP) regressor on cell-type abundances derived via cell2location, our method efficiently distills knowledge from pathology foundation models and demonstrates the ability to accurately predict cell-type compositions from histology images, without physically performing the costly spatial transcriptomics. Our method demonstrates competitive performance compared to existing methods such as Hist2Cell, while significantly reducing computational complexity.

[119] MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation

Hui Li,Pengfei Yang,Juanyang Chen,Le Dong,Yanxin Chen,Quan Wang

Main category: cs.CV

TL;DR: This paper proposes MST-Distill, a novel cross-modal knowledge distillation framework that addresses distillation path selection and knowledge drift issues, significantly outperforming current methods.

Details Motivation: Conventional knowledge distillation methods face challenges in cross-modal settings due to data and statistical heterogeneities, which prevent the effective transfer of complementary prior knowledge embedded in cross-modal teacher models. Method: MST-Distill uses a mixture of specialized teachers across cross-modal and multimodal configurations with an instance-level routing network for adaptive distillation. It also incorporates a plug-in masking module to suppress modality-specific discrepancies and reconstruct teacher representations. Result: Experiments on five multimodal datasets show that MST-Distill significantly outperforms existing state-of-the-art methods in cross-modal distillation tasks. Conclusion: The proposed MST-Distill framework significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal settings, as demonstrated by experiments across five diverse multimodal datasets. Abstract: Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.

[120] Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Parshva Dhilankumar Patel

Main category: cs.CV

TL;DR: 本文设计并开发了一种基于OCR的发票表格提取系统,通过Tesseract OCR与定制后处理技术相结合,有效处理噪声和非标准格式的发票,提高了数据提取的准确性与一致性。

Details Motivation: 为了从扫描的发票文档中准确地检测、对齐和提取结构化的表格数据,以支持自动化财务工作流程和数字存档等实际应用。 Method: 利用Tesseract OCR进行文本识别,并结合自定义后处理逻辑,包括动态预处理、表格边界检测和行列映射。 Result: 开发出一种显著提高数据提取准确性和一致性的管道系统。 Conclusion: 该论文提出了一种基于OCR的高效发票表格提取管道系统,在处理噪声和非标准格式方面具有优势。 Abstract: This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.

[121] Evaluating Large Multimodal Models for Nutrition Analysis: A Benchmark Enriched with Contextual Metadata

Bruce Coburn,Jiangpeng He,Megan E. Rollo,Satvinder S. Dhaliwal,Deborah A. Kerr,Fengqing Zhu

Main category: cs.CV

TL;DR: 本文探讨了如何通过集成上下文元数据(如GPS坐标、时间戳和食物项目)来提高大型多模态模型(LMMs)在营养分析中的表现。文章还介绍了新的公开食品图像数据集ACETADA,并展示了通过智能集成元数据可显著减少预测营养值的误差。

Details Motivation: 论文的动机在于探索现有的工作中主要评估专有模型(如GPT-4)而忽略了广泛的LLMs,并且缺乏对整合上下文元数据及其与各种推理修饰符的相互作用的研究。研究者希望利用上下文元数据提升LMM在估计关键营养值方面的表现。 Method: 论文的方法包括对八个LMMs(四个开源权重和四个闭源权重)进行评估,并介绍了一种新的食品图像数据集ACETADA。通过简单的提示策略应用元数据集成,以减少预测营养值中的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。 Result: 实验结果表明,智能地集成元数据可以通过简单的提示策略显著减少预测营养值中的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。此外,这种上下文信息的引入还增强了推理修饰符(如思维链、多模态思维链、尺度提示、少量示例和专家人物)的效果。 Conclusion: 论文的结论是,结合上下文元数据(如GPS坐标、时间戳和食物项目)可以提高大型多模态模型在营养分析中的性能。此外,引入ACETADA这一新的公开食品图像数据集对于推动该领域的发展具有重要意义。 Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

[122] An AI Approach for Learning the Spectrum of the Laplace-Beltrami Operator

Yulin An,Enrique del Castillo

Main category: cs.CV

TL;DR: This paper proposes a geometric deep learning framework to efficiently predict the LB spectrum of CAD meshes, achieving significant computational savings without sacrificing accuracy compared to traditional FEM methods.

Details Motivation: The spectrum of the Laplace-Beltrami (LB) operator is central in geometric deep learning tasks as it captures intrinsic properties of shapes. However, the established method for its estimation, based on the Finite Element Method (FEM), is inefficient when applied to large databases of CAD mechanical parts or in quality control applications where quick and frequent decisions are needed. Method: A Graph Neural Network architecture that uses a rich set of part mesh features - including Gaussian curvature, mean curvature, and principal curvatures - is proposed to predict the LB spectrum of CAD meshes. Result: Experimental results show that the proposed method reduces computation time of the LB spectrum by approximately 5 times over linear FEM while delivering competitive accuracy. A large curated dataset of real-world mechanical CAD models is also made available for repeatability. Conclusion: The proposed geometric deep learning framework can efficiently predict the LB spectrum of CAD meshes, achieving significant computational savings without sacrificing accuracy, compared to the traditional FEM method. Abstract: The spectrum of the Laplace-Beltrami (LB) operator is central in geometric deep learning tasks, capturing intrinsic properties of the shape of the object under consideration. The best established method for its estimation, from a triangulated mesh of the object, is based on the Finite Element Method (FEM), and computes the top k LB eigenvalues with a complexity of O(Nk), where N is the number of points. This can render the FEM method inefficient when repeatedly applied to databases of CAD mechanical parts, or in quality control applications where part metrology is acquired as large meshes and decisions about the quality of each part are needed quickly and frequently. As a solution to this problem, we present a geometric deep learning framework to predict the LB spectrum efficiently given the CAD mesh of a part, achieving significant computational savings without sacrificing accuracy, demonstrating that the LB spectrum is learnable. The proposed Graph Neural Network architecture uses a rich set of part mesh features - including Gaussian curvature, mean curvature, and principal curvatures. In addition to our trained network, we make available, for repeatability, a large curated dataset of real-world mechanical CAD models derived from the publicly available ABC dataset used for training and testing. Experimental results show that our method reduces computation time of the LB spectrum by approximately 5 times over linear FEM while delivering competitive accuracy.

[123] Reading a Ruler in the Wild

Yimu Pan,Manas Mehta,Gwen Sincerbeaux,Jeffery A. Goldstein,Alison D. Gernand,James Z. Wang

Main category: cs.CV

TL;DR: 本研究提出 RulerNet 和 DeepGP,用于在计算机视觉中解决从像素到真实世界尺寸的转换问题,通过统一的关键点检测方法和合成数据增强实现了高效且通用的尺度估计。

Details Motivation: 准确地将像素测量转换为绝对现实世界维度仍然是计算机视觉中的一个基本挑战,并限制了生物医学、法医学、营养分析和电子商务等关键应用的发展。 Method: 研究提出了一种名为 RulerNet 的深度学习框架,并结合 DeepGP 网络进行实时尺度估计。此外,还设计了一个可扩展的合成数据管道以增加训练多样性。 Result: 实验表明,RulerNet 在具有挑战性的现实条件下提供了准确、一致且高效的尺度估计。 Conclusion: RulerNet 作为一种通用且高效的测量工具,在高影响力领域具有与其他视觉组件集成的潜力,可以实现自动化的、与尺度相关的分析。 Abstract: Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision and limits progress in key applications such as biomedicine, forensics, nutritional analysis, and e-commerce. We introduce RulerNet, a deep learning framework that robustly infers scale "in the wild" by reformulating ruler reading as a unified keypoint-detection problem and by representing the ruler with geometric-progression parameters that are invariant to perspective transformations. Unlike traditional methods that rely on handcrafted thresholds or rigid, ruler-specific pipelines, RulerNet directly localizes centimeter marks using a distortion-invariant annotation and training strategy, enabling strong generalization across diverse ruler types and imaging conditions while mitigating data scarcity. We also present a scalable synthetic-data pipeline that combines graphics-based ruler generation with ControlNet to add photorealistic context, greatly increasing training diversity and improving performance. To further enhance robustness and efficiency, we propose DeepGP, a lightweight feed-forward network that regresses geometric-progression parameters from noisy marks and eliminates iterative optimization, enabling real-time scale estimation on mobile or edge devices. Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions. These results underscore its utility as a generalizable measurement tool and its potential for integration with other vision components for automated, scale-aware analysis in high-impact domains. A live demo is available at https://huggingface.co/spaces/ymp5078/RulerNet-Demo.

[124] Evaluating Attribute Confusion in Fashion Text-to-Image Generation

Ziyue Liu,Federico Girella,Yiming Wang,Davide Talon

Main category: cs.CV

TL;DR: 本文提出了L-VQAScore,一种新的文本到图像生成模型评价指标,通过结合视觉定位与视觉问答探测,更准确地评估生成图像中的实体-属性关联。

Details Motivation: 现有的文本到图像生成模型评价方法在评估丰富的实体-属性语义方面存在局限性,尤其是在属性混淆的情况下表现不佳。 Method: 基于视觉问答定位策略,提出了一种局部化的人类评估协议和一种新的自动评价指标L-VQAScore,结合了视觉定位与VQA探测正确和错误定位的属性生成。 Result: L-VQAScore在与人类判断的相关性方面优于最先进的文本到图像生成模型评价方法。 Conclusion: L-VQAScore是一个可靠的、可扩展的主观评价替代方案,能够更好地捕捉细粒度的实体-属性关联。 Abstract: Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.

[125] Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Ke Fan,Shunlin Lu,Minyue Dai,Runyi Yu,Lixing Xiao,Zhiyang Dou,Junting Dong,Lizhuang Ma,Jingbo Wang

Main category: cs.CV

TL;DR: 本文提出了MotionMillion(目前最大规模的人体动作数据集)与MotionMillion-Eval(最全面的零样本动作生成评估框架),并基于可扩展架构实现7B参数模型,在零样本人体动作生成上取得显著进展。

Details Motivation: 现有方法在零样本泛化能力上受限于训练数据规模,且缺乏完善的评估体系,从而难以推动任务发展。 Method: 开发了高效的标注流程,构建了大规模数据集MotionMillion,并提出了全面评估框架MotionMillion-Eval,同时利用可扩展架构训练7B参数模型进行验证。 Result: 所提方法在跨域及复杂组合动作上表现出强泛化能力,显著提升了零样本人体动作生成水平。 Conclusion: MotionMillion和MotionMillion-Eval的提出,标志着文本到动作生成领域迈向零样本泛化的重大进展。 Abstract: Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.

[126] Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang,Yitong Li,Yu-cheng Chou,Jieneng Chen,Alan Yuille,Chen Wei,Junfei Xiao

Main category: cs.CV

TL;DR: This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework that efficiently builds high-quality vision-language models at a low cost by leveraging existing pretrained models and minimizing reliance on large-scale image-text datasets.

Details Motivation: The motivation is to reduce the high costs and resource requirements typically associated with training state-of-the-art Vision-Language Models (VLMs), which usually require billions of image-text pairs and millions of GPU hours. Method: The method involves creating a Vision-Language-Vision (VLV) auto-encoder that uses a frozen vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and a Large Language Model (LLM). The language representation space is regularized to create an information bottleneck. A pretrained LLM is then fine-tuned to decode intermediate representations into detailed descriptions. Result: The result is a state-of-the-art captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash, achieved with minimal training costs and without the need for large paired image-text datasets. Conclusion: The VLV framework provides a cost-effective and data-efficient method for building state-of-the-art vision-language models with strong captioning capabilities by leveraging pretrained components. Abstract: Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.

[127] 4KAgent: Agentic Any Image to 4K Super-Resolution

Yushen Zuo,Qi Zheng,Mingyang Wu,Xinrui Jiang,Renjie Li,Jian Wang,Yide Zhang,Gengchen Mai,Lihong V. Wang,James Zou,Xiaoyu Wang,Ming-Hsuan Yang,Zhengzhong Tu

Main category: cs.CV

TL;DR: 4KAgent是一个统一的代理超分辨率系统,能够将任何图像升级到4K分辨率并在多个成像领域表现出色。

Details Motivation: 开发一个能够普遍提升任何图像至4K分辨率的统一代理超分辨率系统,以处理从极低分辨率且严重退化的输入图像得到清晰、逼真的4K输出图像。 Method: 4KAgent包含三个核心组件:(1) 根据特定用例定制4KAgent管道的分析模块;(2) 利用视觉-语言模型和图像质量评估专家来分析输入图像并制定定制恢复计划的感知代理;(3) 通过递归执行-反思范式执行该计划的恢复代理,由质量驱动的专家混合策略指导选择每个步骤的最佳输出。此外,4KAgent嵌入了一个专门的人脸恢复管道。 Result: 4KAgent在涵盖总共26个不同基准的11个不同任务类别中进行了严格评估,在广泛的成像领域中树立了新的最先进水平。其评估涵盖了自然图像、肖像照片、AI生成内容、卫星图像、荧光显微镜和医学成像等领域,并在感知(如NIQE, MUSIQ)和保真度(如PSNR)指标上展示了卓越的性能。 Conclusion: 4KAgent是一个统一的代理超分辨率通用系统,可以将任何图像升级到4K分辨率,并在多个成像领域表现出卓越的性能。 Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.

[128] Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

Vatsal Agarwal,Matthew Gwilliam,Gefen Kohavi,Eshan Verma,Daniel Ulbricht,Abhinav Shrivastava

Main category: cs.CV

TL;DR: This paper explores using diffusion models as visual encoders in MLLMs, offering better detail capture and improved performance on complex visual tasks.

Details Motivation: To overcome the limitations of CLIP as a visual encoder in MLLMs by exploring whether diffusion models can provide more detailed and aligned visual representations. Method: The researchers analyzed internal representations of diffusion models, explored their alignment with LLMs, identified a leakage phenomenon, and proposed a mitigation strategy along with a fusion strategy combining CLIP and diffusion features. Result: The approach demonstrated promising results on general VQA and specialized MLLM benchmarks, showing the potential of diffusion models in visual understanding tasks. Conclusion: The study concludes that pre-trained text-to-image diffusion models can serve as effective visual encoders for MLLMs, especially enhancing vision-centric tasks requiring spatial and compositional reasoning. Abstract: Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.