Presented by: Aida Nematzadeh from Deep Mind
Duration: 2 hours
On: 06 Apr, 2022
Location: Gothenburg and Online
Abstract: There has been an increased interest in developing general-purpose foundation models across different domains, such as language, vision, and multimodal. The appeal of this approach is pre-training models on large datasets once, and then adopting them to various tasks using a smaller supervised dataset. Moreover, these models achieve impressive results on a range of benchmarks, often performing better than task-specific models. In this talk, I will argue that we need better evaluation pipelines to better understand the shortcomings and strengths of pre-trained models. In particular, I will talk about: (1) the necessity of directly measuring real-world performance (as opposed to relying on benchmark performance), (2) the importance of strong baselines, and (3) how to design probing dataset to measure certain capabilities of our models. I will focus on commonsense reasoning and verb understanding as two challenging domains for our existing pretrained models.