By Roy Xu and Sai Vegasena
Fuzzing consists of randomized input generation to automate the process of finding crashes in programs. The probability of finding a crash increases as code coverage in the application increases. In a well formed fuzzing campaign, test cases are generated in an automatic, systematic way such that every iteration generates a greater likelihood of crashing. The structure of input for an application must also be taken into account during this process.
The premise of our research was to fuzz the JavaScript engine V8 by generating valid inputs. Specifically our target was to apply grammars to generate code that would trigger the most optimizations in TurboFan, the JIT in V8. The feedback metric used to assess coverage in the JIT was the input’s change over time through all the optimization passes. Using this metric grammar that favors heavy optimization can be selected and mutated in future iterations.
TurboFan design consists of bitcode being JITed to quality machine code through phases of optimization passes. It makes use of static type information to accurately convert bytecode to machine code. During this process the bitcode can be translated into an intermediate representation where traits like evaluation order, code elimination, allocations, and bounds checks can be analyzed. By parsing the nodes in the IR, conclusions can be drawn about how actively the JIT optimized our test cases, and allows for updates to our grammars accordingly. Building V8 with sanitizers like ASAN, and SANCOV allows for methodology to be tested and verified.
The goal of our research was to develop a feedback metric to measure code coverage in the V8 JIT, and leverage a custom method of grammar generation that accurately handled this feedback. Furthermore, the grammar generation component also had to preserve the interactions between different concrete grammar test cases. Our solution to this problem seeks to provide insight into fuzzing large applications when symbolic execution is not valid for dynamic test case generation, developing performant code generation systems with grammars, and uncovering new properties about JIT compiler bugs. A system such as this is able to start with versatile JavaScript and nurture specific test cases that force the JIT to do more work at each optimization pass.
At high level a scoring system was developed based off percent changes after a test case passed through every phase of optimization. Test cases are initially generated from a corpus of grammars. Grammars are scored from the concrete test cases generated from them, and grammars with high scores are more likely to get selected in the next iteration of test case generation. A system for mutating, and preserving concrete interactions between different grammars was developed through the idea of grouping which will be elaborated in the Design and Implementation section.
At the heart of our design is a coverage metric based on changes in number of optimization passes. We divide our fuzzing into a series of rounds focused on a different subset of grammar rules. These rules are progressively combined and scored at each stage. We distinguish between three variations of grammar rules: actions, variables, and controls.
Actions are generic snippets of JavaScript that can be applied to a variable. Variables are the various ways that JavaScript objects can be initialized. Controls are different methods of changing the control flow of a generated grammar. Our current generation defines a subset of rules as five actions and a control. These are picked randomly from unique grammar files that contain all actions and controls. Since actions are applied to variables, the set of variables used in a round depends on the actions that we pick. For every action, we pick at least one corresponding variable. However, in cases where there are multiple actions relying on the same variable type, we pick a random number between one and the total number of reliant actions. In some cases, actions will share a variable and in others, each will have their own variable. To bring our variables together, we have the concept of interactions. Currently, these consist of addition, subtraction, multiplication, and division. Pairs of variables are assigned a random interaction to allow a resulting combination of all variables which can be returned and printed by V8.
Fuzzing rounds start with a single action and control. A variable is generated for that action and the result is combined into valid Javascript. We cycle through our five actions before expanding our action depth to two. Our grammar generation now contains a combination of two actions. At each stage we keep track of the parent from which this generation originated from. For example, given a set of actions: {0, 1, 2}, the parent of this set would be the set {1, 2}.
Initially, our fuzzer took advantage of the open source grammar generation library, Dharma. However, our unique requirements and procedure meant that writing our own grammar generation would result in much faster performance. To save computational time over the course of running the fuzzer, we initially “unravel” our grammars. We look into the master lists of actions, variables, and controls and expand them out as far as we can while still being deterministic on startup. Usually this means we avoid extending to a point where we believe we are generating some end value. For example, if an action requires a random numerical value, we stop before generating it. When creating a fuzzing round, we pick a set of actions, variables, and controls from the unraveled grammar. This set must pass through another processing round where we fully expand out by choosing a random end value if necessary. As a result of our caching, we were able to vastly improve on generation time over dharma.
Our framework itself consists of three parts: grammar generation, coverage, and our fuzzing instrumentation. Generated grammars are placed into a thread safe queue where our instrumentation running on different threads can ask for a grammar. A single thread handles generating the grammar, which has managed to keep up so far. The coverage portion of the framework coordinates between the grammar generation and the fuzzing itself. Our coverage information is stored in a redis db. When a grammar is generated we increment a counter in redis and when the grammar is processed by the fuzzing instrumentation we increment a separate counter. Before proceeding to a new round, we ensure that these counters match.
When the fuzzing instrumentation finishes running a test case, we distinguish between results by exit codes. Timeouts and exit codes of 1 (V8 catches an error in the Javascript) are treated as invalid test cases and given a score of 0. All other exit codes are treated as a crash and given a very large score. Normal runs are passed along to the coverage portion along with the parent id of the generated grammar. A score is calculated as the sum of the changes between each optimization pass in TurboFan. If the parents score exists, that score is added to create a cumulative score.
At the end of a fuzzing round we take the highest scoring set of rules (currently 10) and create a group for each. We call a set of actions, variables, controls, and interactions a group. These are currently stored as JSON and contain all the information necessary to recreate a test case. When our initial pool of groups reaches a point we determine to be impactful (~10,000), we move away from our progressive generation to a mutation based method.
Each group allows for several mutations to be applied to it. For example, we may choose to change some interactions, replace an action, or regenerate an action if it relies on an end value. We compare each mutation to the base group and work towards eliciting more overall optimization passes.
Our initial goal for fuzzing was to find ways to find ways to make TurboFan do more work. We hope that by eliciting more optimization passes, we can explore more of the TurboFan codebase as well as finding unique optimizations not normally created by fuzzers. By generating our grammars progressively, we focus on a growth in optimization passes instead of Javascript that generally produces more optimizations. We believe that this is a better method because we avoid being pushed towards working on only the same few rules over and over. In order to utilize our scoring, we decided to create groupings from our best results and mutate on them.
There are certainly improvements to be made that would complement the design and scalability of vasilisk. With a larger corpus of initial grammars, vasilisk would be more robust with the ability to trigger more types of JIT optimizations with complicated language features. Furthermore, we could find a way to combine sets of grammars we already have to produce more interesting code after observing our first few trials with the fuzzer. Our current mechanism for re-scaling grammar weights was simply adding to the weight after each iteration. Implementing Q-learning would more aptly fit our use case, and adjust weights in a systematic way with less bias. This method would allow vasilisk to learn the effectiveness of grammars in a more meaningful, long term capacity. The mutations could also be improved within the groups. Perhaps developing more interactions and mechanics for relating variables to actions would complicate overall mutations positively. Furthermore, our current method of storing interesting test cases and crashes is writing to /dev/shm
. An actual database where test cases could be properly deleted or mutated would promote less memory usage and greater scalability. A minor issue is that our variables grammar does not support variables within variables. Adding this would make mutations slightly more diverse. An interesting addition would also be to cache fuzzer runs and weights. Vasilisk could stop learning and store test cases and weights in a db. If it could then be restarted and pick off where it left off, it could continue to learn and update grammar weights instead of needing to start from scratch every time. We must also fix the bugs that were discovered during experimentation. Enhancing vasilisk to make better use of cpu resources and use as many cores as possible to their fullest, but as Vasilisk is written in python it makes working with threads at a more granular level more difficult.