Back to Search Start Over

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors :
Srivastava, Aarohi
Rastogi, Abhinav
Rao, Abhishek
Shoeb, Abu Awal Md
Abid, Abubakar
Fisch, Adam
Brown, Adam R.
Santoro, Adam
Gupta, Aditya
Garriga-Alonso, Adrià
Kluska, Agnieszka
Lewkowycz, Aitor
Agarwal, Akshat
Power, Alethea
Ray, Alex
Warstadt, Alex
Kocurek, Alexander W.
Safaya, Ali
Tazarv, Ali
Xiang, Alice
Parrish, Alicia
Nie, Allen
Hussain, Aman
Askell, Amanda
Dsouza, Amanda
Slone, Ambrose
Rahane, Ameet
Iyer, Anantharaman S.
Andreassen, Anders
Madotto, Andrea
Santilli, Andrea
Stuhlmüller, Andreas
Dai, Andrew
La, Andrew
Lampinen, Andrew
Zou, Andy
Jiang, Angela
Chen, Angelica
Vuong, Anh
Gupta, Animesh
Gottardi, Anna
Norelli, Antonio
Venkatesh, Anu
Gholamidavoodi, Arash
Tabassum, Arfa
Menezes, Arul
Kirubarajan, Arun
Mullokandov, Asher
Sabharwal, Ashish
Herrick, Austin
Efrat, Avia
Erdem, Aykut
Karakaş, Ayla
Roberts, B. Ryan
Loe, Bao Sheng
Zoph, Barret
Bojanowski, Bartłomiej
Özyurt, Batuhan
Hedayatnia, Behnam
Neyshabur, Behnam
Inden, Benjamin
Stein, Benno
Ekmekci, Berk
Lin, Bill Yuchen
Howald, Blake
Orinion, Bryan
Diao, Cameron
Dour, Cameron
Stinson, Catherine
Argueta, Cedrick
Ramírez, César Ferri
Singh, Chandan
Rathkopf, Charles
Meng, Chenlin
Baral, Chitta
Wu, Chiyu
Callison-Burch, Chris
Waites, Chris
Voigt, Christian
Manning, Christopher D.
Potts, Christopher
Ramirez, Cindy
Rivera, Clara E.
Siro, Clemencia
Raffel, Colin
Ashcraft, Courtney
Garbacea, Cristina
Sileo, Damien
Garrette, Dan
Hendrycks, Dan
Kilman, Dan
Roth, Dan
Freeman, Daniel
Khashabi, Daniel
Levy, Daniel
González, Daniel Moseguí
Perszyk, Danielle
Hernandez, Danny
Chen, Danqi
Ippolito, Daphne
Gilboa, Dar
Dohan, David
Drakard, David
Jurgens, David
Datta, Debajyoti
Ganguli, Deep
Emelin, Denis
Kleyko, Denis
Yuret, Deniz
Chen, Derek
Tam, Derek
Hupkes, Dieuwke
Misra, Diganta
Buzan, Dilyar
Mollo, Dimitri Coelho
Yang, Diyi
Lee, Dong-Ho
Schrader, Dylan
Shutova, Ekaterina
Cubuk, Ekin Dogus
Segal, Elad
Hagerman, Eleanor
Barnes, Elizabeth
Donoway, Elizabeth
Pavlick, Ellie
Rodola, Emanuele
Lam, Emma
Chu, Eric
Tang, Eric
Erdem, Erkut
Chang, Ernie
Chi, Ethan A.
Dyer, Ethan
Jerzak, Ethan
Kim, Ethan
Manyasi, Eunice Engefu
Zheltonozhskii, Evgenii
Xia, Fanyue
Siar, Fatemeh
Martínez-Plumed, Fernando
Happé, Francesca
Chollet, Francois
Rong, Frieda
Mishra, Gaurav
Winata, Genta Indra
de Melo, Gerard
Kruszewski, Germán
Parascandolo, Giambattista
Mariani, Giorgio
Wang, Gloria
Jaimovitch-López, Gonzalo
Betz, Gregor
Gur-Ari, Guy
Galijasevic, Hana
Kim, Hannah
Rashkin, Hannah
Hajishirzi, Hannaneh
Mehta, Harsh
Bogar, Hayden
Shevlin, Henry
Schütze, Hinrich
Yakura, Hiromu
Zhang, Hongming
Wong, Hugh Mee
Ng, Ian
Noble, Isaac
Jumelet, Jaap
Geissinger, Jack
Kernion, Jackson
Hilton, Jacob
Lee, Jaehoon
Fisac, Jaime Fernández
Simon, James B.
Koppel, James
Zheng, James
Zou, James
Kocoń, Jan
Thompson, Jana
Wingfield, Janelle
Kaplan, Jared
Radom, Jarema
Sohl-Dickstein, Jascha
Phang, Jason
Wei, Jason
Yosinski, Jason
Novikova, Jekaterina
Bosscher, Jelle
Marsh, Jennifer
Kim, Jeremy
Taal, Jeroen
Engel, Jesse
Alabi, Jesujoba
Xu, Jiacheng
Song, Jiaming
Tang, Jillian
Waweru, Joan
Burden, John
Miller, John
Balis, John U.
Batchelder, Jonathan
Berant, Jonathan
Frohberg, Jörg
Rozen, Jos
Hernandez-Orallo, Jose
Boudeman, Joseph
Guerr, Joseph
Jones, Joseph
Tenenbaum, Joshua B.
Rule, Joshua S.
Chua, Joyce
Kanclerz, Kamil
Livescu, Karen
Krauth, Karl
Gopalakrishnan, Karthik
Ignatyeva, Katerina
Markert, Katja
Dhole, Kaustubh D.
Gimpel, Kevin
Omondi, Kevin
Mathewson, Kory
Chiafullo, Kristen
Shkaruta, Ksenia
Shridhar, Kumar
McDonell, Kyle
Richardson, Kyle
Reynolds, Laria
Gao, Leo
Zhang, Li
Dugan, Liam
Qin, Lianhui
Contreras-Ochando, Lidia
Morency, Louis-Philippe
Moschella, Luca
Lam, Lucas
Noble, Lucy
Schmidt, Ludwig
He, Luheng
Colón, Luis Oliveros
Metz, Luke
Şenel, Lütfi Kerem
Bosma, Maarten
Sap, Maarten
ter Hoeve, Maartje
Farooqi, Maheen
Faruqui, Manaal
Mazeika, Mantas
Baturan, Marco
Marelli, Marco
Maru, Marco
Quintana, Maria Jose Ramírez
Tolkiehn, Marie
Giulianelli, Mario
Lewis, Martha
Potthast, Martin
Leavitt, Matthew L.
Hagen, Matthias
Schubert, Mátyás
Baitemirova, Medina Orduna
Arnaud, Melody
McElrath, Melvin
Yee, Michael A.
Cohen, Michael
Gu, Michael
Ivanitskiy, Michael
Starritt, Michael
Strube, Michael
Swędrowski, Michał
Bevilacqua, Michele
Yasunaga, Michihiro
Kale, Mihir
Cain, Mike
Xu, Mimee
Suzgun, Mirac
Walker, Mitch
Tiwari, Mo
Bansal, Mohit
Aminnaseri, Moin
Geva, Mor
Gheini, Mozhdeh
T, Mukund Varma
Peng, Nanyun
Chi, Nathan A.
Lee, Nayeon
Krakover, Neta Gur-Ari
Cameron, Nicholas
Roberts, Nicholas
Doiron, Nick
Martinez, Nicole
Nangia, Nikita
Deckers, Niklas
Muennighoff, Niklas
Keskar, Nitish Shirish
Iyer, Niveditha S.
Constant, Noah
Fiedel, Noah
Wen, Nuan
Zhang, Oliver
Agha, Omar
Elbaghdadi, Omar
Levy, Omer
Evans, Owain
Casares, Pablo Antonio Moreno
Doshi, Parth
Fung, Pascale
Liang, Paul Pu
Vicol, Paul
Alipoormolabashi, Pegah
Liao, Peiyuan
Liang, Percy
Chang, Peter
Eckersley, Peter
Htut, Phu Mon
Hwang, Pinyu
Miłkowski, Piotr
Patil, Piyush
Pezeshkpour, Pouya
Oli, Priti
Mei, Qiaozhu
Lyu, Qing
Chen, Qinlang
Banjade, Rabin
Rudolph, Rachel Etta
Gabriel, Raefer
Habacker, Rahel
Risco, Ramon
Millière, Raphaël
Garg, Rhythm
Barnes, Richard
Saurous, Rif A.
Arakawa, Riku
Raymaekers, Robbe
Frank, Robert
Sikand, Rohan
Novak, Roman
Sitelew, Roman
LeBras, Ronan
Liu, Rosanne
Jacobs, Rowan
Zhang, Rui
Salakhutdinov, Ruslan
Chi, Ryan
Lee, Ryan
Stovall, Ryan
Teehan, Ryan
Yang, Rylan
Singh, Sahib
Mohammad, Saif M.
Anand, Sajant
Dillavou, Sam
Shleifer, Sam
Wiseman, Sam
Gruetter, Samuel
Bowman, Samuel R.
Schoenholz, Samuel S.
Han, Sanghyun
Kwatra, Sanjeev
Rous, Sarah A.
Ghazarian, Sarik
Ghosh, Sayan
Casey, Sean
Bischoff, Sebastian
Gehrmann, Sebastian
Schuster, Sebastian
Sadeghi, Sepideh
Hamdan, Shadi
Zhou, Sharon
Srivastava, Shashank
Shi, Sherry
Singh, Shikhar
Asaadi, Shima
Gu, Shixiang Shane
Pachchigar, Shubh
Toshniwal, Shubham
Upadhyay, Shyam
Shyamolima
Debnath
Shakeri, Siamak
Thormeyer, Simon
Melzi, Simone
Reddy, Siva
Makini, Sneha Priscilla
Lee, Soo-Hwan
Torene, Spencer
Hatwar, Sriharsha
Dehaene, Stanislas
Divic, Stefan
Ermon, Stefano
Biderman, Stella
Lin, Stephanie
Prasad, Stephen
Piantadosi, Steven T.
Shieber, Stuart M.
Misherghi, Summer
Kiritchenko, Svetlana
Mishra, Swaroop
Linzen, Tal
Schuster, Tal
Li, Tao
Yu, Tao
Ali, Tariq
Hashimoto, Tatsu
Wu, Te-Lin
Desbordes, Théo
Rothschild, Theodore
Phan, Thomas
Wang, Tianle
Nkinyili, Tiberius
Schick, Timo
Kornev, Timofei
Tunduny, Titus
Gerstenberg, Tobias
Chang, Trenton
Neeraj, Trishala
Khot, Tushar
Shultz, Tyler
Shaham, Uri
Misra, Vedant
Demberg, Vera
Nyamai, Victoria
Raunak, Vikas
Ramasesh, Vinay
Prabhu, Vinay Uday
Padmakumar, Vishakh
Srikumar, Vivek
Fedus, William
Saunders, William
Zhang, William
Vossen, Wout
Ren, Xiang
Tong, Xiaoyu
Zhao, Xinran
Wu, Xinyi
Shen, Xudong
Yaghoobzadeh, Yadollah
Lakretz, Yair
Song, Yangqiu
Bahri, Yasaman
Choi, Yejin
Yang, Yichi
Hao, Yiding
Chen, Yifu
Belinkov, Yonatan
Hou, Yu
Hou, Yufang
Bai, Yuntao
Seid, Zachary
Zhao, Zhuoye
Wang, Zijian
Wang, Zijie J.
Wang, Zirui
Wu, Ziyi
Source :
Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj
Publication Year :
2022

Abstract

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.<br />Comment: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Details

Database :
arXiv
Journal :
Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj
Publication Type :
Report
Accession number :
edsarx.2206.04615
Document Type :
Working Paper