Back to Search Start Over

First-Generation Inference Accelerator Deployment at Facebook

Authors :
Anderson, Michael
Chen, Benny
Chen, Stephen
Deng, Summer
Fix, Jordan
Gschwind, Michael
Kalaiah, Aravind
Kim, Changkyu
Lee, Jaewon
Liang, Jason
Liu, Haixin
Lu, Yinghai
Montgomery, Jack
Moorthy, Arun
Nadathur, Satish
Naghshineh, Sam
Nayak, Avinash
Park, Jongsoo
Petersen, Chris
Schatz, Martin
Sundaram, Narayanan
Tang, Bangsheng
Tang, Peter
Yang, Amy
Yu, Jiecao
Yuen, Hector
Zhang, Ying
Anbudurai, Aravind
Balan, Vandana
Bojja, Harsha
Boyd, Joe
Breitbach, Matthew
Caldato, Claudio
Calvo, Anna
Catron, Garret
Chandwani, Sneh
Christeas, Panos
Cottel, Brad
Coutinho, Brian
Dalli, Arun
Dhanotia, Abhishek
Duncan, Oniel
Dzhabarov, Roman
Elmir, Simon
Fu, Chunli
Fu, Wenyin
Fulthorp, Michael
Gangidi, Adi
Gibson, Nick
Gordon, Sean
Hernandez, Beatriz Padilla
Ho, Daniel
Huang, Yu-Cheng
Johansson, Olof
Juluri, Shishir
Kanaujia, Shobhit
Kesarkar, Manali
Killinger, Jonathan
Kim, Ben
Kulkarni, Rohan
Lele, Meghan
Li, Huayu
Li, Huamin
Li, Yueming
Liu, Cynthia
Liu, Jerry
Maher, Bert
Mallipedi, Chandra
Mangla, Seema
Matam, Kiran Kumar
Mehta, Jubin
Mehta, Shobhit
Mitchell, Christopher
Muthiah, Bharath
Nagarkatte, Nitin
Narasimha, Ashwin
Nguyen, Bernard
Ortiz, Thiara
Padmanabha, Soumya
Pan, Deng
Poojary, Ashwin
Ye
Qi
Raginel, Olivier
Rajagopal, Dwarak
Rice, Tristan
Ross, Craig
Rotem, Nadav
Russ, Scott
Shah, Kushal
Shan, Baohua
Shen, Hao
Shetty, Pavan
Skandakumaran, Krish
Srinivasan, Kutta
Sumbaly, Roshan
Tauberg, Michael
Tzur, Mor
Verma, Sidharth
Wang, Hao
Wang, Man
Wei, Ben
Xia, Alex
Xu, Chenyu
Yang, Martin
Zhang, Kai
Zhang, Ruoxi
Zhao, Ming
Zhao, Whitney
Zhu, Rui
Mathews, Ajit
Qiao, Lin
Smelyanskiy, Misha
Jia, Bill
Rao, Vijay
Publication Year :
2021

Abstract

In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a single-socket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs. We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2107.04140
Document Type :
Working Paper