Back to Search
Start Over
Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens
- Publication Year :
- 2023
-
Abstract
- Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in addition to one BERT forward pass.
Details
- Database :
- arXiv
- Publication Type :
- Report
- Accession number :
- edsarx.2309.04333
- Document Type :
- Working Paper