2024 | Larger and more instructable language models become less reliable ⭐ Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yolanda Moros-Daval, Cèsar Ferri, José Hernández-Orallo NotesSomething easily missed is that the tasks get difficult quickly. Locality is just an obnoxiously difficult task. Just look at the human expectations of failure on the x-axis, for the easiest bin it is already 91%. For addition, one of the easiest instances that GPT-4 v2 fails on is "12191 + 3187383000 =", which is in difficulty bin 1. This is quite easy, because there are no carry operations, but it looks scary and especially with tokenisation, its a challenge for LLMs. There are actually no instances with the simplicity of "100 + 150 where" where the latest models fail. There probably are, but they were not included in our dataset. We should have been more explicit about this in the paper. On the other hand, confirming our notion of difficulty, GPT-4 v2 also just fails on "Make the addition of 984733680321971 and 0."... There is a long way to go. |
2024 | Investigating Object Permanence in Deep Reinforcement Learning Agents Konstantinos Voudouris, Jason Darwin Liu, Natasza Siwinska, Wout Schellaert, Lucy G. Cheke |
2024 | Scaling Behavior of Large Language Models Wout Schellaert, Ronan Hamon, Fernando Martínez-Plumed, José Hernández-Orallo |
2023 | Predictable Artificial Intelligence Lexin Zhou, Pablo A. Moreno-Casares, Fernando Martínez-Plumed, John Burden, Ryan Burnell, Lucy Cheke, Cèsar Ferri, Alexandru Marcoci, Behzad Mehrbakhsh, Yael Moros-Daval, Seán Ó hÉigeartaigh, Danaja Rutar, Wout Schellaert, Konstantinos Voudouris, José Hernández-Orallo |
2023 | Animal-AI 3: What's New & Why You Should Care Konstantinos Voudouris, Ibrahim Alhas, Wout Schellaert, Matthew Crosby, Joel Holmes, John Burden, Niharika Chaubey, Niall Donnelly, Matishalin Patel, Marta Halina, José Hernández-Orallo, Lucy G. Cheke |
2023 | Rethink Reporting of Evaluation Results in AI ⭐ Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martínez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, José Hernández-Orallo |
2023 | Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models Wout Schellaert, Fernando Martínez-Plumed, Karina Vold, John Burden, Pablo A. M. Casares, Bao Sheng Loe, Roi Reichart, Sean Ó hÉigeartaigh, Anna Korhonen, José Hernández-Orallo JAIR: AI and Society [paper] |
2022 | Reject Before You Run: Granular Performance Prediction for Big Language Models with Small External Assessors Lexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri, Wout Schellaert Workshop on Evaluation Beyond Metrics at IJCAI 2022 [paper, workshop] |
2022 | Training on the Test Set: Mapping the System-Problem Space in AI ⭐ José Hernández-Orallo*, Wout Schellaert*, Fernando Martínez-Plumed* (*equal contribution) Blue Sky Idea Award 🏆 |