1. CWE Pattern Identification using Semantical Clustering of Programming Language Keywords
- Author
-
Traian Rebedea, Stefan Trausan-Matu, and Sergiu Zaharia
- Subjects
Source code ,Syntax (programming languages) ,Programming language ,business.industry ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Identification (information) ,Application security ,Code (cryptography) ,Cluster analysis ,business ,computer ,Word (computer architecture) ,media_common ,Agile software development - Abstract
Applications are one of the most used attack surfaces, and they must be secured at source code level, early in the development phase. Developers' inherited culture of programming preserves the patterns of code writing within big organizations or developers' communities, opening an opportunity to use SAST (Static Application Security Testing) complementary solutions to identify insecure code early in the development phase. We propose an Intermediate Representation, strict enough to maintain the security vulnerabilities patterns as defined by MITRE with the Common Weaknesses Enumeration, at the same time agile enough to not strongly depend on the lexical and syntax structure of the programming language, but following programmers' behavior of writing code. The current research phase uses semantical clustering of instructions (keywords) found in C/C++ programs, based on Word Embeddings, which are transported via the resulting (numerical) Intermediate Representation to the various classifiers for security vulnerability patterns detection. We show that there is a good preservation of security patterns despite the generalization of keywords via semantical clustering. This opens an opportunity for innovation in security vulnerability patterns identification, which is more dependent on the programmers' code writing behavior than the programming language specific structure.
- Published
- 2021
- Full Text
- View/download PDF