A high-performance VM for the emerging language, brainf*ck. The goal is to have a (fast) interpreter which then JIT compiles hot loops.
Neither the interpreter nor the JIT understands BF source; instead they work with a bytecode based IR. The bc_from_source function in bytecode.c parses BF source and generates a stream of bytecodes representing it. The term bytecode is somewhat misleading since a bytecode is four bytes long, usually followed by one or more four-byte payloads.
Translating to bytecode is important for performance. Individual BF instructions do very little and sequences like +++ can be quickly executed at one go (by increasing the current cell’s value by 3, in this case) than by three separate bytecode dispatches. To this end, bc_from_source attempts to do some basic peephole optimizations by recognizing common BF idioms. For instance, one way to add the contents of the current cell to a cell three indices away (and clear the current cell in the process) is [->>>+<<<]. bc_from_source recognizes this pattern and emits a specialized bytecode, BC_MOVE_VALUE, which is specially handled by the interpreter and the JIT.
The interpreter is quite simplistic. It uses GCC's labels as values feature to provide for a quick dispatch mechanism. Every loop has a heat counter associated with it, whose numerical value is stored as payload following the BC_LOOP_BEGIN bytecode. The interpreter checks this value and calls in the JIT once a method is hot. The JIT does its thing, and stores away a pointer to the generated code in the program_t structure. The interpreter then calls into this generated code. The compiler also patches the BC_LOOP_BEGIN to a BC_COMPILED_LOOP so that the next iteration directly executes the JITed version of the loop.
The compiler uses dynasm to generate a code-generator; and as a consequence most of the compiler really lives in codegen.dasc. This might change if I later add some analysis or optimization passes which run only when compiling the bytecode, but for now the codegen practically is the compiler. The only compiler specific optimization is that the compiler tries to not emit a redundant cmp instruction if the last instruction makes it unnecessary (by virtue of being an add instruction with a negative constant as one operand).
Running mandlebrot.bf takes 1.25 seconds on average, which is at par with compiling BF to C and then compiling the resultant C file with gcc -O3. While I’d call this pretty good, I think there is still some room for improvement, perhaps studying the machine code generated for typical BF programs will yield some insight.
There are two improvements I can think of, believe will improve performance, but haven’t had time to implement:
- A register allocator should improve performance, but probably not as much as in normal programming languages -- BF is a very cache friendly language. :)
- Some basic auto-vectorization. Maybe we can say "this loop simply adds these two blocks of cells together, one by one" and emit some SSE instructions to that effect.