📨 wschell@vrain.upv.es📜 Google Scholar🧑‍💻 GitHub🟢 ORCiD🗺️ Valencian Research Institute for Artificial Intelligence (VRAIN)and🗺️ Leverhulme Centre for the Future of Intelligence, Cambridge University (LCFI)

Supervised by: 🧙🏼‍♂️ José Hernández-Orallo and 🧙🏼 FernandoMartínez-Plumed

Interests

The main content of my PhD-research involves modelling AI evaluation as a prediction problem.

In general I am interested in AI evaluation & everything that is related: testing, auditing, metrics, environment & benchmark design, capability measurement, etc.

Other concepts that spark my imagination include causality, embodied cognition, web agents, knowledge representation, grounding, philosophy of cognition, artificial life.

Most things really.

Papers

Highlights marked by ⭐.

2024Larger and more instructable language models become less reliable
Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yolanda Moros-Daval, Cèsar Ferri, José Hernández-Orallo
Nature [paper]
Notes

Something easily missed is that the tasks get difficult quickly. Locality is just an obnoxiously difficult task. Just look at the human expectations of failure on the x-axis, for the easiest bin it is already 91%.

For addition, one of the easiest instances that GPT-4 v2 fails on is "12191 + 3187383000 =", which is in difficulty bin 1. This is quite easy, because there are no carry operations, but it looks scary and especially with tokenisation, its a challenge for LLMs.

There are actually no instances with the simplicity of "100 + 150 where" where the latest models fail. There probably are, but they were not included in our dataset. We should have been more explicit about this in the paper.

On the other hand, confirming our notion of difficulty, GPT-4 v2 also just fails on "Make the addition of 984733680321971 and 0."... There is a long way to go.

2024Investigating Object Permanence in Deep Reinforcement Learning Agents
Konstantinos Voudouris, Jason Darwin Liu, Natasza Siwinska, Wout Schellaert, Lucy G. Cheke
COGSCI [paper]
2024Scaling Behavior of Large Language Models
Wout Schellaert, Ronan Hamon, Fernando Martínez-Plumed, José Hernández-Orallo
SCALE-LLM [paper]
2023Predictable Artificial Intelligence
Lexin Zhou, Pablo A. Moreno-Casares, Fernando Martínez-Plumed, John Burden, Ryan Burnell, Lucy Cheke, Cèsar Ferri, Alexandru Marcoci, Behzad Mehrbakhsh, Yael Moros-Daval, Seán Ó hÉigeartaigh, Danaja Rutar, Wout Schellaert, Konstantinos Voudouris, José Hernández-Orallo
arXiv [paper]
2023Animal-AI 3: What's New & Why You Should Care
Konstantinos Voudouris, Ibrahim Alhas, Wout Schellaert, Matthew Crosby, Joel Holmes, John Burden, Niharika Chaubey, Niall Donnelly, Matishalin Patel, Marta Halina, José Hernández-Orallo, Lucy G. Cheke
arXiv [paper]
2023Rethink Reporting of Evaluation Results in AI
Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martínez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, José Hernández-Orallo
Science [paper, preprint]
2023Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models
Wout Schellaert, Fernando Martínez-Plumed, Karina Vold, John Burden, Pablo A. M. Casares, Bao Sheng Loe, Roi Reichart, Sean Ó hÉigeartaigh, Anna Korhonen, José Hernández-Orallo
JAIR: AI and Society [paper]
2022Reject Before You Run: Granular Performance Prediction for Big Language Models with Small External Assessors
Lexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri, Wout Schellaert
Workshop on Evaluation Beyond Metrics at IJCAI 2022 [paper, workshop]
2022Training on the Test Set: Mapping the System-Problem Space in AI
José Hernández-Orallo*, Wout Schellaert*, Fernando Martínez-Plumed* (*equal contribution)
Blue Sky Idea Award 🏆
AAAI 2022 [paper, award]

Other

2023
Co-organising the “Predictable AI” kick-off event in Valencia

A singular event consisting of invited talks, panels and short lightning talks. It discussed “Predictable AI Futures” dealing with topics such as scaling laws, control, liability and future risks; as well as “Predictable AI Systems”, covering cognitive and robust evaluation, assessors, co-operative conditions, uncertainty estimation, and much more. (site)

Committee: José Hernández-Orallo, Ana Cidad and many others from ValGRAI in Valencia and the LCFI and CSER in Cambridge.

2022
Co-organising the “Evaluation Beyond Metrics” workshop at IJCAI22

Workshop with the goal to challenge the widespread approach of evaluating intelligent systems with aggregated metrics over a benchmark or distribution of tasks. (site)

Committee: Wout Schellaert, Joshua Tenenbaum, Lucy Cheke, Tomer Ullman, José Hernández-Orallo, José Hernández-Orallo, Danaja Rutar, John Burden and Ryan Burnell.