At Praetorian, we streamline routine and mundane tasks so we can focus on tackling novel problems. In our zeal for both of those, we have identified a peculiar challenge that has brought us to the cutting edge of Machine Learning (ML): detecting vulnerabilities in source code.Typically, identifying a vulnerability in code that transcends functions and classes within a project is a highly esoteric task. It requires a seasoned computer security engineer (CSE) to delineate a vulnerable function from all the safe ones in potentially millions of lines of code per project. Consider the fact that popular software applications such as Google Chrome and Android Operating System have over 6.5 and about 15 million lines of code respectively. A CSE peruses each and every line in order to find vulnerabilities, and often needs to take multiple passes to ensure they have identified threats. We thought this highly effective but tedious process could be streamlined by the application of ML. However, the task is far more complicated than teaching a computer to find a needle in a haystack; rather, we must teach a computer to find a certain needle in a pile of a million needles.
This is where Praetorian’s ML team enters the scene. One of the first things we realized is that Machine Learning on Code (MLoC) is a unique class with some similarities to existing ML & deep learning approaches in Natural Language Processing (NLP), but enough differences to demand innovation.
Looking at similarities, they are both sequences of tokens and both convey a certain intent of the author. So, it is only intuitive that current state of the art NLP specific deep learning approaches such as Sequence to Sequence (S2S) should work well on code. There are a number of such approaches where S2S experiments were reported to perform well on code and achieved decent results on some tasks such as bug prediction, code completion and automatic comment generation [1] [2] and [3]. Interestingly, all of the mentioned applications treat words in code purely as a sequence of tokens. But vulnerability detection is largely under-explored and this was exactly the type of problem we hoped to tackle! From a vulnerability perspective, the standard S2S approaches do not span the distance or the depth that is needed to solve this problem. This can be attributed to the fact that a typical vulnerability can span across several functions or even classes. For example, a project with 1500 lines of code could have a tainted variable first appear in line 71 but only becomes an actual issue in line 1257. This is an extremely long sequence to keep in memory when using traditional S2S approaches. However, these approaches have established a powerful representation of tokens which can be useful in modeling code-specific attributes.
Another similarity between NLP data and code is that they both can be expressed as trees: Context Free Grammar (CFG)-based parse trees and Abstract Syntax Trees (AST). However, the edges between the two types of trees are different because they are bound by different grammars. The CFG trees branch out on different linguistic features such as noun phrases and ASTs are structural representations of code. So it is unlikely that any one particular graphical or tree-based modeling approach will be effective on both the domains.
As for differences, whilst both are sequences of words to express a certain intent, NLP is bound by grammatical semantics and code is bound by logic and semantics of a programming language in which the code is written. The true stumbling block in the S2S model, at least for our purposes, is this inherent difference between NLP data and code. NLP-specific approaches can predict if the definite article, “the,” will precede a noun or proper noun, or that the conjunction, “and,” will not follow an adjective. This predictability does not transfer to code, though, because a vulnerability in code does not equate to a discrete word in natural language. Rather, analyzing code requires an understanding of the layers, bespoke programming language security features, and logic that affect execution of that code segment. Continuing our previous example, the vulnerability in line 71 may be mitigated entirely by resolving the logic in a line far later in the code, or it may be activated by inserting a flaw therein.
Another major difference between NLP and code is the hierarchical dependency between different aspects of code. This dependency can be in the form of control flow, data flow or even a simple conditional relationship. In most programming languages, order of appearance of methods has no bearing on their execution and thus have no effect on detecting a vulnerability. However, the same logic cannot be applied to a normal piece of text. Imagine, starting Romeo & Juliet with “Come, bitter poison, come, unsavory guide!” and then introducing the characters “Two households, both alike in dignity”. For code in Java, on the other hand, one could define methods in any order in a class and use them anywhere in any sequence either in the same class or any other class by importing them. Of course, there are some restrictions in Java too, but they are constrained by logic of control flow. So, effectively, code is a fine interwoven mesh of inter-procedural calls, control-flow dependencies and data flow graphs; all these aspects are represented by words as a sequence of tokens. Thus, any modeling of code for vulnerabilities, must, therefore, consider this fine network of interconnections. Consequently, to detect vulnerabilities, MLoC needs to leverage both NLP-based S2S approaches along with approaches that model the data flow, control flow and hierarchical dependencies.
Soon after realizing our unique problem of MLoC was even more unique than we had anticipated, we embraced that fact that we also would need to develop innovative solutions to train, test, and evaluate whatever models we developed. We are solving a novel problem and it requires methods that operate beyond typical image classification or NLP approaches. In fact, the critical challenge was the lack of a relevant data set for supervised training of our modeling approaches. There are no linguistic data consortium (LDC) corpora with a fully annotated dataset to train an MLoC model in an hour and report precision, recall and F-score. Neither are there any existing GitHub repositories to check the code and get running right away. We needed a dataset that went beyond bug detection to find true vulnerabilities. A dataset rich enough to allow unsupervised pre-training and supervised fine-tuning. We are going beyond detecting accidental division by zero; we want instead to identify where a bad actor could surreptitiously creep into and exploit a system. Since that type of dataset did not exist, we created what we needed over the course of several weeks. Having solved our unique data set sub-problem, we could move forward to other related questions — such as testing our model’s performance — and continue innovating.
To this end, we are building mathematical models that use ground-breaking methods to capture the esoteric nature of detecting vulnerabilities in code. Furthermore, we have access to Praetorian’s world-class CSEs, who have worked closely with us to enable our model to capture the expertise and experience required for successful vulnerability detection. Praetorian, as a nimble start-up, also has granted the ML team both flexibility and freedom to identify and solve our MLoC problem and its corollaries. This latitude and unwavering support from our colleagues has enabled us to define a new realm of machine learning and trail-blaze the state of the art.
References:
[1] Feng et al. in CodeBERT, : A Pre-Trained Model for Programming and Natural Languages, CORR-2020
[2] Kim, S., Zhao, J., Tian, Y. and Chandra S., “Code Prediction by Feeding Trees to Transformers”, ACM, 2020
[3]Uri A., Omer L., and Eran Y.. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2019a