1. A Bigram-based Inference Model for Retrieving Abbreviated Phrases in Source Code
- Author
-
Dianxiang Xu, Abdulrahman Alatawi, and Weifeng Xu
- Subjects
Source code ,Phrase ,business.industry ,Computer science ,media_common.quotation_subject ,Bigram ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Inference ,020207 software engineering ,02 engineering and technology ,Software maintenance ,computer.software_genre ,ComputingMethodologies_PATTERNRECOGNITION ,Software ,020204 information systems ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Language model ,business ,computer ,Natural language processing ,Word (computer architecture) ,media_common - Abstract
Expanding abbreviations in source code to their full meanings is very useful for software maintainers to comprehend the source code. The existing approaches, however, focus on expanding an abbreviation to a single word, i.e., unigram. They do not perform well when dealing with abbreviations of phrases that consist of multiple unigrams. This paper proposes a bigram-based approach for retrieving abbreviated phrases automatically. Key to this approach is a bigram-based inference model for choosing the best phrase from all candidates. It utilizes the statistical properties of unigrams and bigrams as prior knowledge and a bigram language model for estimating the likelihood of each candidate phrase of a given abbreviation. We have applied the bigram-based approach to 100 phrase abbreviations, randomly selected from eight open source projects. The experiment results show that it has correctly retrieved 78% of the abbreviations by using the unigram and bigram properties of a source code repository. This is 9% more accurate than the unigram-based approach and much better than other existing approaches. The bigram-based approach is also less biased towards specific phrase sizes than the unigram-based approach.
- Published
- 2020