Back to Search
Start Over
Automated Data-Processing Function Identification Using Deep Neural Network
- Source :
- IEEE Access, Vol 8, Pp 55411-55423 (2020)
- Publication Year :
- 2020
- Publisher :
- IEEE, 2020.
-
Abstract
- The number of software vulnerabilities is increasing year by year. In the era of big data, data-processing software with many users is more concerned by hackers. It is essential to improve the efficiency of discovering vulnerabilities in data-processing software. We noticed that in the process of discovering vulnerabilities, some problems of existing technology such as fuzzing, symbolic execution, and taint analysis have more or fewer relationships with data-processing functions. In fuzzing, there are two types of sanity checks toward the target program: NCC (Non-critical check) and CC (critical check). It is usually challenging to bypass such a sanity check, which leads to low code coverage during fuzzing. In symbolic execution, the constraint solver still has the problem of trying to deal with the constraints of complex algorithms. In taint analysis, the problem of over-taint and under-taint is always the key to affect the accuracy of the results. Therefore, to solve the above problems, it is necessary to identify the data-processing function. Based on identifying data-processing functions, we could identify those sanity checks, ease the solution of complex constraints, and understand the way of taints propagation to assist in software vulnerability discovery and analysis. This paper proposed a method called DPFI(data-processing function identification) for identifying data-processing functions with deep neural networks. We collected 37000 functions from GitHub and implemented the method on the data set with several neural networks, among which the performance of CNN achieved best and $F_{1}$ -score was 0.90. We then applied the trained model on CGC(cyber grand challenge) data and real softwares for testing. For CGC, we got 448 functions in 20 programs, in which 35 were identified as data-processing functions. For real softwares, such as FFmpeg, 7zip, jpeg, the precision rate all reached 0.90 and $F_{1}$ -score was above 0.87.
- Subjects :
- Source code
source code
General Computer Science
Data-processing
Computer science
media_common.quotation_subject
Automated data processing
vulnerability
Code coverage
Symbolic execution
Machine learning
computer.software_genre
Taint checking
Software
function identification
General Materials Science
media_common
Vulnerability (computing)
business.industry
General Engineering
deep neural network
Fuzz testing
Identification (information)
Artificial intelligence
lcsh:Electrical engineering. Electronics. Nuclear engineering
business
computer
lcsh:TK1-9971
Subjects
Details
- Language :
- English
- ISSN :
- 21693536
- Volume :
- 8
- Database :
- OpenAIRE
- Journal :
- IEEE Access
- Accession number :
- edsair.doi.dedup.....09e5ab3055bbe730bc11ecf2032f6152