Research -- Wout Schellaert

📨 wschell@vrain.upv.es📜 Google Scholar🧑‍💻 GitHub🟢 ORCiD🗺️ Valencian Research Institute for Artificial Intelligence (VRAIN)and🗺️ Leverhulme Centre for the Future of Intelligence, Cambridge University (LCFI)

Supervised by: 🧙🏼‍♂️ José Hernández-Orallo and 🧙🏼 Fernando Martínez-Plumed

Interests

The main content of my PhD-research involves modelling AI evaluation as a prediction problem.

In general I am interested in AI evaluation & everything that is related: testing, auditing, metrics, environment & benchmark design, capability measurement, etc.

Other concepts that spark my imagination include causality, embodied cognition, web agents, knowledge representation, grounding, philosophy of cognition, artificial life.

Most things really.

Papers

Highlights marked by ⭐.

2025	The Evaluation of Artificial Intelligence as a Prediction Problem ⭐ Wout Schellaert PhD Thesis [thesis] Notes What I recommend reading / skipping: Chapter 1: Introduction - Mostly motivation / first half. Chapter 2: Background - Read 2.4 - 2.6, and the rest only if unfamiliar with GPAI or calibration. Chapter 3: Evaluation as a Prediction Problem - Read fully, most important part. Chapter 4: Applications & Methods - Read or skim, but do pay attention to 4.3, especially on types of generalisation. Chapter 5: Language Model Score Predictors - Skip or skim. Chapter 6: Human Score Predictors - Skim Chapter 7: Conclusion - Read
2025	PredictaBoard: Benchmarking LLM Score Predictability Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert arXiv [paper] Notes Apart from the idea and some work on metrics at the beginning of the paper, all credits go to collaborators.
2025	The Animal-AI Environment: A virtual laboratory for comparative cognition and artificial intelligence research Konstantinos Voudouris, Ben Slater, Lucy G. Cheke, Wout Schellaert, José Hernández-Orallo, Marta Halina, Matishalin Patel, Ibrahim Alhas, Matteo G. Mecattaf, John Burden, Joel Holmes, Niharika Chaubey, Niall Donnelly, Matthew Crosby Behavior Research Methods [paper]
2025	Analysing the Predictability of Language Model Performance Wout Schellaert, Fernando Martínez-Plumed, José Hernández-Orallo ACM Transactions on Intelligent Systems and Technology [paper] Notes I do not recommend anyone reading this paper, which is why I do not have a free copy online. If you really want to read it, it is Chapter 4 of my PhD thesis.
2024	Larger and more instructable language models become less reliable ⭐ Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yolanda Moros-Daval, Cèsar Ferri, José Hernández-Orallo Nature [paper] Notes Something easily missed is that the tasks get difficult quickly. Locality is just an obnoxiously difficult task. Just look at the human expectations of failure on the x-axis, for the easiest bin it is already 91%. For addition, one of the easiest instances that GPT-4 v2 fails on is "12191 + 3187383000 =", which is in difficulty bin 1. This is quite easy, because there are no carry operations, but it looks scary and especially with tokenisation, its a challenge for LLMs. There are actually no instances with the simplicity of "100 + 150 where" where the latest models fail. There probably are, but they were not included in our dataset. We should have been more explicit about this in the paper. On the other hand, confirming our notion of difficulty, GPT-4 v2 also just fails on "Make the addition of 984733680321971 and 0."... There is a long way to go.
2024	Investigating Object Permanence in Deep Reinforcement Learning Agents Konstantinos Voudouris, Jason Darwin Liu, Natasza Siwinska, Wout Schellaert, Lucy G. Cheke COGSCI [paper]
2024	Scaling Behavior of Large Language Models Wout Schellaert, Ronan Hamon, Fernando Martínez-Plumed, José Hernández-Orallo SCALE-LLM [paper]
2023	Predictable Artificial Intelligence Lexin Zhou, Pablo A. Moreno-Casares, Fernando Martínez-Plumed, John Burden, Ryan Burnell, Lucy Cheke, Cèsar Ferri, Alexandru Marcoci, Behzad Mehrbakhsh, Yael Moros-Daval, Seán Ó hÉigeartaigh, Danaja Rutar, Wout Schellaert, Konstantinos Voudouris, José Hernández-Orallo arXiv [paper]
2023	Rethink Reporting of Evaluation Results in AI ⭐ Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martínez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, José Hernández-Orallo Science [paper, preprint]
2023	Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models Wout Schellaert, Fernando Martínez-Plumed, Karina Vold, John Burden, Pablo A. M. Casares, Bao Sheng Loe, Roi Reichart, Sean Ó hÉigeartaigh, Anna Korhonen, José Hernández-Orallo JAIR: AI and Society [paper]
2022	Reject Before You Run: Granular Performance Prediction for Big Language Models with Small External Assessors Lexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri, Wout Schellaert Workshop on Evaluation Beyond Metrics at IJCAI 2022 [paper, workshop]
2022	Training on the Test Set: Mapping the System-Problem Space in AI ⭐ José Hernández-Orallo, Wout Schellaert, Fernando Martínez-Plumed* (equal contribution) Blue Sky Idea Award 🏆* AAAI 2022 [paper, award]

Other

2023

Co-organising the “Predictable AI” kick-off event in Valencia

A singular event consisting of invited talks, panels and short lightning talks. It discussed “Predictable AI Futures” dealing with topics such as scaling laws, control, liability and future risks; as well as “Predictable AI Systems”, covering cognitive and robust evaluation, assessors, co-operative conditions, uncertainty estimation, and much more. (site)

Committee: José Hernández-Orallo, Ana Cidad and many others from ValGRAI in Valencia and the LCFI and CSER in Cambridge.

2022

Co-organising the “Evaluation Beyond Metrics” workshop at IJCAI22

Workshop with the goal to challenge the widespread approach of evaluating intelligent systems with aggregated metrics over a benchmark or distribution of tasks. (site)

Committee: Wout Schellaert, Joshua Tenenbaum, Lucy Cheke, Tomer Ullman, José Hernández-Orallo, José Hernández-Orallo, Danaja Rutar, John Burden and Ryan Burnell.