Skip to content
Roger edited this page Dec 19, 2016 · 3 revisions

How to ...

So you're interested in contributing, but are unsure on how to get started? Here are few primers on how to extend Reko's functionality. First, though, you should make yourself familiar with the Design of the solution.

...implement a new processor architecture

To implement a processor architecture, be it a real physical processor or a virtual machine, you need to implement the IProcessorArchitecture and related interfaces.

Naturally, a prerequisite is that you are familiar with the processor, have access to manuals from the manufacturer that describe its architecture, its instruction repertoire, and how its machine code is encoded. In the following discussion, we assume you are implementing support for the (fictitious) processor MicroFoo. As you follow these instructions, it will be helpful to consult the souce code of some of the other implemented processor architectures as a guideline.

The first thing to do is to create a new project in the Reko solution, under the ~/src/Arch directory. Your source code should use a namespace like Reko.Arch.MicroFoo, and your assembly should be named Reko.Arch.MicroFoo.

The first class in your project is MicroFooArchitecture, which must implement the IProcessorArchitecture interface. Your implementation must describe how many registers the processor has, how large pointers are, what endianness is used to interpret words stored in memory, and other such machine specific details. Pay special attention to the GetRegister() methods, which return a RegisterStorage instance representing a specific processor register.

Consult the processor manual and obtain a list of all the opcodes. Create an enum called Opcode and put the opcodes inside. It is a good idea to make the first Opcode be invalid:

enum Opcode {
    invalid = 0,   // return this when disassembling bytes that aren't valid machine code.
    load,
    store,
    addi,    // etc.
}

Most processor instructions will have one or more operands. The Reko.Core.Machine namespace defines the abstract class MachineOperand and two concrete subclasses, which implement the common cases of a RegisterOperand and a ImmediateOperand, containing a RegisterStorage and a Constant, respectively. MicroFoo may have strange memory addressing modes (see the M68k processor architecture for an particularly complex set of addressing modes), so you will likely need to derive your own subclasses from MachineOperand to support each of MicroFoo's addressing modes.

To represent disassembled machine instructions you need to create a class MicroFooInstruction, deriving from Reko.Core.Machine.MachineInstruction, that models the instructions of the MicroFoo processor. You will want to keep track of an opcode and allocate enough member variables to store all possible operands for any machine instruction the processor can execute. You should be familiar enough with your architecture to know the maximum number of operands can have, and allocate space accordingly. Suppose MicroFoo instructions can have at most 3 operands. You have the choice of either implementing this as three discrete fields or properties:

class MicroFooInstruction : MachineInstruction {
    public Opcode opcode;
    public MachineOperand op1, op2, op3;
}

or as an array, which may vary in length depending on the instruction:

class MicroFooInstruction : MachineInstruction {
    public Opcode opcode;
    public MachineOperand [] ops;
}

Importantly, every implementor of a MachineInstruction should override the Render() method. This method is used by the user interface to render machine instructions into -- possibly colorized -- text. You can use the different methods of the supplied MachineInstructionWriter reference to write opcodes and addresses. The user interface components will then render the opcodes and addresses in a different color, and make addresses hyperlinks to their destinations.

Once you've implemented the representation of MicroFoo's instructions, you are ready to create the MicroFooDisassembler. Disassemblers can be viewed as filters that take an input stream of bytes provided by an ImageReader and return a stream of disassembled machine instructions. To do this, MicroFooDisassembler implements the IEnumerable<MicroFooInstruction> interface. The constructor of the disassembler will need to accept at least one operand, the ImageReader. The implementation of IEnumerable<MicroFooInstruction.GetEnumerator() returns an enumerator whose MoveNext method is responsible for reading one or more bytes using the ImageReader, interpreting the machine code represented by those bytes, and returning a MicroFooInstruction.

Implementing a disassembler is a large task, especially for processors with large numbers of instructions, addressing modes, or both. The work can be made easier by exploiting regularities in the machine code encodings. We strongly recommend implementing the various instructions by creating a [unit test] for each instruction like this:

[Test]
public void MicroFoo_dasm_movi()
{
    AssertCode("movi\tr1,0x42", 0x12, 0x00, 0x42);
}

The test should be read as "when the byte sequence 0x12 0x00 0x42 is encountered, the disassembler should return the machine instruction movi r1,0x42". Consult the source code for PowerPCDisassembler for a good example of how to implement this.

Finally, you will need to implement an instruction rewriter MicroFooRewriter. Rewriters can be viewed as filters that take an input stream of machine instructions and return a stream of low-level register transfer level instructions (RTL) that model possibly very complex machine code with very simple operations. Rewriting a typical CISC instruction can result in many RTL instructions; for instance rewriting the M68k instruction:

    add.l -(a3),d0

results in the following RTL instructions:

    a3 = a3 - 4
    tmp1 = Mem[a3:word32]
    d0 = d0 + tmp1
    CVZNX = cond(d0)

which model the predecrement operator and the setting of condition codes. Later passes of the decompiler will strive to reduce this to more compact high-level language representation.

When you've created a disassembler and a rewriter, you can implement the CreateDisassembler() and CreateRewriter() methods on MicroFooArchitecture. Now you're ready to make the Decompiler aware of your processor architecture by adding a references to it in the configuration file for the Decompiler. In the app.config file, look for the <Architectures> section, and add the following element:

<Architecture
    Name="uFoo"
    Description="MicroFoo Architecture"
    Type="Decompiler.Arch.MicroFoo.MicroFooArchitecture,Decompiler.Arch.MicroFoo" />

To test that your new architecture is working, you need to make the build process copy your architecture into the directory where Decompiler is built. You'll need to modify the WindowsDecompiler.csproj file manually and add an entry in the <Architectures> element.

...implement new binary image file format

Make sure you have access to a specification of the image file format available. In the following discussion, we assume you are implementing support for the (fictitious) image file format FooExe.

First, create a new project in the Reko solution, under the ~/src/ImageLoaders directory. Your code should use a namespace like Reko.ImageLoaders.FooExe, and your assembly should be named Reko.ImageLoaders.FooExe

The central class of your project will be responsible for loading an image from an array of bytes that the Reko framework will have read from a file. In our example, it would be named FooExeImageLoader and it must have Reko.Core.ImageLoader as its base class.

The constructor of the image loader must take exactly three parameters:

  • a IServiceProvider reference, which you can use to access services provided by the core decompiler. For instance, if your image loader needs to show a user interface, like a dialog box, while loading, you can use the IServiceProvider reference to access the IDecompilerUIService service.
  • the name of the file that contained the image. Note that you don't need to open this file; the file name is provided in the case that the image loader loads differently depending on, say, the file extension.
  • an array of raw bytes loaded from the file.

Your image loader needs to implement the abstract Load and Relocate methods. The Load method is responsible for ensuring the image is a valid one, and creating a LoadedImage which is what the executable looks like once it has been loaded into memory; in general, the byte array in the LoadedImage will be different from the raw bytes in the binary file. Your Load method must also determine what processor architecture and what operating environment the program was intended for. Finally, if the image format supports segments, you must add them to an ImageMap object.

The Load method finishes by returning an instance of the Program class, which will contain

  • a processor architecture for the image
  • a LoadedImage containing the post-load layout of the program
  • an ImageMap that describes any segments described in the image format.
  • an instance of a class derived from Platform that the operating environment the program was written for.

The Relocate() method applies any relocations that may be necessary. Some executables don't have any relocations, while others do. Consult the specification and model your implementation after the ones for, say PE executables.

Once the loader is completed, you need to make Reko aware of it, and give rules for deciding whether a given file is in fact a FooExe file. Very often, an executable file can be identified by the presence of a magic number, often (but not always!) located at the start of the file. For instance, a large class of Microsoft executable files start with the bytes 4D 5A, (interpreted as ASCII, these are the initials of Mark Zbikowski, who designed the image file format for MS-DOS 2.0), while ELF images will start with the bytes 7F 45 4C 46 The appropriate section in the configuration file is called <Loaders>. For our sample loader, we would add the following sub-element:

<Loader
    MagicNumber="464F4F0A"
    Offset="0"
    Type="Reko.ImageLoaders.FooExe.FooExeLoader,Decompiler.ImageLoaders.FooExe" />

which specifies that if the four bytes at offset 0 of the image file match the magic number specified, use a Reko.ImageLoaders.FooExe.FooExeLoader to load the image.

...incorporate a new OllyDBG unpacker script

Assume that you have a OllyScript file called FooUnpacker.osc and you wish to have it be executed automatically when an image file, packed by the (fictitious) FooPacker version 1.0, is loaded. You must first identify a signature, that is a sequence of byte values that uniquely identify the packer in question. Given a signature, you need to add an XML element in the file ~/src/Decompiler/Loading/Signatures/IMAGE_FILE_MACHINE_I386.xml like this:

  <ENTRY>
    <NAME>FooPacker v1.0</NAME>
    <COMMENTS />
    <ENTRYPOINT>45A4EA??????D3</ENTRYPOINT>
    <ENTIREPE />
  </ENTRY>

The <ENTRYPOINT> sub-element specifies that the given pattern of bytes must be present at the entry point of the program for this to be considered a match for an image packed by "FooPacker v1.0".

After specifying the signature in the signature file, you need to tell Reko what script to use to unpack it. This is done by adding an element like the following to the <Loaders> section of Reko configuration file:

<Loader
   Label="FooPacker v1.0"
   Argument="FooUnpacker.osc"
   Type="Decompiler.ImageLoaders.OdbgScript.OdbgScriptLoader,Decompiler.ImageLoaders.OdbgScript" />

Here we're stating that if we have detected a "FooPacker v1.0" signature, then we will use the OdbgScriptLoader to load the unpacker file FooUnpacker.osc.

...add support more instructions for a particular processor

The development team makes an effort to provide disassembler and rewriter support for all instructions handled by each processor, but resource constraints sometimes cause us to fall short of this goal. If you have discovered that a particular processor's disassembler is not able to disassemble what you know is a valid sequence of machine code bytes, you can add support to this yourself.

Start by following the Test Driven Development methodology and creating a unit test for the missing instruction. Locate the disassembler unit tests for the processor architecture in question. They typically look like this:

[Test]
public void X86_xor()
{
    AssertCode("xor\teax,eax"", 0x33, 0xC0);
}

Here the test is asserting that when the disassembler encounters the bytes 33 C0, the disassembler should emit an instruction which when converted to a string, reads xor eax,eax.

Run the unit tests. If the byte sequence you provided is not yet supported, the unit test will fail. Now you need to implement the disassembly of the instruction. Use the other, already implemented instructions as a guideline. Disassemblers vary widely in their implementation, but the majority perform lookups in arrays and/or dictionary to perform the mapping from byte value to disassembled instruction.

Once the disassembler unit test is passing, it's time to change the corresponding RTL rewriter. Locate the rewriter unit tests; they will likely look something like this:

[Test]
public void X86_Rewrite_Xor()
{
    AssertCode(0x33, 0xC0,
        "0|00100000(2): 3 instructions",
        "1|L--|eax = eax ^ eax"
        "2|L--|SZ = cond(eax)"
        "3|L--|C = false");
}

The first line states that an instruction starting at address 00100000 and being 2 bytes long was rewritten into 3 RTL instructions. The remaining lines are those RTL instructions. The field after the line number states that this instruction is classified L (for 'linear'). A jump or call statement might have been classified as T (for 'transfer').

Note how the RTL rewriter must be careful to model the effects of the machine code exactly. Many x86 programs depend on the carry flag being clear after certain logic operations. The translation

[Test]
public void X86_Rewrite_Xor()
{
    AssertCode(0x33, 0xC0,
        "0|00100000(2): 2 instructions",
        "1|L--|eax = eax ^ eax"
        "2|L--|SZC = cond(eax)");
}

while in a strict sense is as correct as the previous translation, is not as good since we lose the opportunity of leveraging the fact the C (carry) flag is clear in later stages of the decompiler.