What is Joern?
Joern is a static analysis tool for C / C++ code. It builds a graph that models syntax. The graphs are built out using Joern’s fuzzy parser. The fuzzy parser allows for Joern to parse code that is not necessarily in a working state (i.e., does not have to compile). Joern builds this graph with multiple useful properties that allow users to define meaningful traversals. These traversals can be used to identify potentially vulnerable code with a low false-positive rate.
Joern is easy to set up and import code with. The graph traversals, which are written using a graph database query language called Gremlin, are simple to write and easy to understand.
Why use Joern?
Joern builds a Code Property Graph out of the imported source code. Code Property Graphs combine the properties of Abstract Syntax Trees, Control Flow Graphs, and Program Dependence Graphs. By leveraging various properties from each of these three source code representations, Code Property Graphs can model many different types of vulnerabilities. Code Property Graphs are explained in much greater detail in the whitepaper on the subject. Example queries can be found in a presentation on Joern’s capabilities. While the presentation does an excellent job of demonstrating the impact of running Joern on the source code for the Linux kernel (running two queries led to seven 0-days out of the 11 total results!), we will be running a slightly more general query on a simple code snippet. By following the query outlined in the presentation, we can write similar queries for other potentially dangerous methods.
Running Joern on Toy Code
Installing Joern and importing code are very simple, as the official documentation on this process is easy to understand and quite painless. I suggest just following the Installation and Importing Code sections of the official documentation. For simplicity’s sake, we imported a simple “vulnerable” piece of code that demonstrates Joern’s capabilities:
vulnerable.c
#include <stdio.h>#include <string.h>#include <stdlib.h>void not_vulnerable(char *s) {char buf[1024];int size = sizeof s;if (size > 1024) {exit(1);}else {strcpy(buf, s);}}void vulnerable(char *s) {char buf[1024];int size = sizeof s;/* No check to check if s is larger than the buffer */strcpy(buf, s);}
Once you have code imported into your Neo4J database, writing and running queries is simple. Here is an example Python script for using the Python-Joern interface to run Gremlin queries on an imported code base:
joern.py
from joern.all import JoernStepsj = JoernSteps()j.setGraphDbURL('http://localhost:7474/db/data/')j.connectToDatabase()query = """getArguments('strcpy', '1').sideEffect{ argument = it.code;}.unsanitized({it._().or(_().isCheck('.*' + argument + '.*'),_().codeContains('.*min.*'))}).locations()"""print "[+] Running query!"results = j.runGremlinQuery(query)print "[+] Number of results: " + str(len(results))for r in results:print r
The results of running the Python script:
[+] Running query![+] Number of results: 1[u'char * s', u'vulnerable', u'../code-bases/buffer_overflow/buffer.c']
As you can see, the vulnerable method is correctly returned in the results, while the not_vulnerable method is absent. This is obvious when we break down the query.
Learn more: If you are new to graph database queries, I recommend reviewing Marko Rodriguez’s post On the Nature of Pipes.
Breaking Down the Query
getArguments('strcpy', '1')
Retrieve all nodes that represent the second argument (NOTE: zero-based indexing) being passed to the method “strcpy.” For our Toy Code, this will emit two nodes. One represents the argument “s” being passed in the “not_vulnerable” method and the other represents the argument “s” being passed to the argument in the “vulnerable” method.
.sideEffect{ argument = it.code;}
Save the code representation of the argument. For both nodes this is just the string “s.”
.unsanitized({it._().or(_().isCheck('.*' + argument + '.*'),_().codeContains('.*min.*'))})
The unsanitized closure determines if a control flow path was “sanitized.” Because we are hunting for vulnerabilities, sanitized code is not good. The unsanitized closure takes two arguments: a sanitization closure (i.e., a function that will return True if a node is considered “sanitized”) and a source closure. The source closure, which is not used in this example query, could filter paths that do not start from a particular source. As an example, let’s say we wanted to find all unsanitized paths to a “strcpy” argument that must have been defined as the type “int.” We could pass a closure in the source parameter which filters out paths that do not define the variable as an “int.” In our example, we only define the sanitization closure. Our sanitization closure just considers a path “sanitized” if the argument is checked against some other variable or value, or the path uses the argument in a min method.
.locations()
This changes the output of the query. This closure is optional.
Significantly Reduce False Positives
Often, source code audits involve finding “candidate points” for vulnerabilities. One process of identifying candidate points is using grep to find potentially dangerous methods. This can lead to an incredibly high false positive rate due to grep’s lack of context awareness. Just by using very general Joern queries (“general” meaning they can be further modified to fit a code base for better results), one can reduce the false positive rate significantly. In the tables below, I have demonstrated these capabilities by comparing grep results to the results from using Joern queries:
Potentially dangerous method: memcpy
Potentially dangerous method: strcpy
Potentially dangerous method: sprintf
The Results Speak for Themselves
By leveraging Joern in your source code audit toolkit, finding potentially vulnerable candidate points is simple and effective. Joern may have a larger learning curve than that of grep, but the results speak for themselves. In some cases, Joern reduced the amount of candidate points by approximately 95%. More importantly, the queries run for this demonstration would work on any C / C++ codebase. By tailoring the queries to the codebase, one can improve these results even further.
To learn more about other features found in Joern, please see Joern’s official documentation and Joern’s public GitHub repo.