This is a tool to benchmark compilers correctness, by testing a compiled functions output on a given set of I/O pairs to the function. This library provides a method of generating these pairs using GCC. A function can then be checked against these I/O pairs, to see if the compiled version produces the correct output.
This project support Python 3.9 only. It depends on synthesis-eval for an example set.
The main script can be found in "eval.py", which can be used as follows.
python3 eval.py REFERENCES IMPLEMENTATIONS
Here REFERENCES
is the directory containing all reference directories,
and IMPLEMENTATIONS
is the directory contatining all imlementation directories.
As an example the script would be called as python3 eval.py refs impls
if the following was the directory structure.
.
+-- refs/
| +-- a/
| | +-- ref.c
| | +-- props
| +-- b/
| | +-- ref.c
| | +-- props
+-- impla/
| +-- a/
| | +-- a-1.s
| | +-- a-2.s
| | +-- a-3.s
| +-- b/
| | +-- b-1.s
| | +-- b-2.s
| | +-- b-3.s
| | +-- b-4.s
The output of the benchmarks are written as a report to stdout, and any issues encountered in the process are written to "benchmarking.log".
The library is modular, so it is made up of a few different components.
- reference_parser.py
- runner.py
- examples.py
- randomiser.py
- evaluation.py
- helper_types.py
- lumberjack.py
reference_parser
can be used as a command line tool
This component extracts useful information from a reference, and can check the validity of a reference. Note, a reference is a directory containing a "ref.c" and "props" file.
As a module this exposes the convenience function load_reference
which provides a simpler interface for parsing a reference from a directory.
As a command line tool it provides functionality to check a single or all references in a given location. Usage is as follows:
python3 reference_parser.py [-p PATH] [-a | PROGRAM]
Here PATH
denotes the location of the reference functions, and it defaults to the current directory.
If the flag -a
is given then all references in PATH
are checked. This outputs the signature of all the functions
to stdout, useful for testing, and outputs any issues with the references to stderr.
If instead PROGRAM
is given then that denotes the single reference to check. This will output a JSON representation of
the reference if it is valid, otherwise it will write any issues to stderr.
In the following directory structure, where *
denotes the current directory:
.
+-- src *
| +-- reference_parser.py
+-- examples
| +-- a
| | +-- ref.c
| | +-- props
| +-- b
| | +-- ref.c
| | +-- props
Then you could run:
$ python3 reference_parser.py -p /examples a
$ python3 reference_parser.py -p /examples b
to produce output for a
/b
individually.
You could also run:
$ python3 reference_parser.py -p /examples -a
which would produce a simplified output for both a
and b
.
This component contains the functionality for executing a reference function. It uses the ctypes
library, which is
able to run shared objects. All conversions
between native and foreign types are handled here, so the use of the library is opaque to the programmer.
The method create
is also exposed; this is a helper to set up an executable reference function.
The implementations being tested may have problems, some of these will be fatal;
if they were run directly by the benchmarker then any crashes would also cause the benchmark to crash,
leaving all the following implementations untested.
To prevent this from happening whenever the function is called a new process is spawned.
The implementation runs in this process, and its outputs are extracted into the parent process (the benchmark).
The function run_safe
is exposed to do this.
This component handles examples, which in this use case are the I/O pairs for a reference. It exposes
an ExampleInstance
class to structure these, as well as the functions parse
(which maps a string to
an ExampleInstance
)
and form
(which maps an ExampleInstance
to a string). These two functions are the inverse of each other.
The format for serialising an example is as follows:
EXAMPLE ::= "(" INPUTS ")" RETURN "(" OUTPUTS ")"
INPUTS ::= VALUES
RETURN ::= "_" | VALUE
OUTPUTS ::= EMPTY | VALUES
VALUES ::= VALUE | VALUE "," VALUES
VALUE ::= SCALAR | ARRAY
SCALAR ::= INT | FLOAT | CHAR | BOOL
INT ::= <any decimal integer>
FLOAT ::= <any decimal floating point number>
CHAR ::= "'" <a single character> "'"
BOOL ::= True | False
ARRAY ::= '"' <a string of characters> '"' | "[" (SCALAR ",")+ SCALAR "]" | "[]"
This component handles random element generation. It exposes the Randomiser
class, which has a method to generate a
random value for each of the supported types.
This component contains the functionality to generate and evaluate examples for a reference.
This component provides functions that do the main tasks, generate
, write_examples
, and evaluate
.
As the name suggests generate
is used to generate examples, given a reference and an implementation to use;
these examples are produced in the method described in generation.
write_examples
can then be used to store these in a file if necessary.
Again, evaluate
does as it says, and evaluates a set of examples on a particular implementation.
To avoid circular dependencies all common data is extracted into a separate file.
helper_types
exposes the types ScalarValue
, ArrayValue
, SomeValue
, AnyValue
. These are used to denote the native
Python versions of C types supported in this library. The meanings of scalar and array values should be obvious, except
note that as Python lacks a char
type both this and char *
are defined as Python str
's.
SomeValue
is either a scalar or array value, and AnyValue
differs from SomeValue
only because it can be None
/void
.
It also exposes the aliases Name
and Parser
which are just used to clarify the purpose of a value in certain places.
Similarly the type ParameterMapping
is commonly used to store parameter names and their values in a dictionary.
There are also many errors defined in helper_types
. These are used throughout,
and depending on what they are used for they may be either a simple subclass of Exception
,
or actually provide some functionality.
lumberjack
is a simple interface to logging
.
This is used to ensure all loggers are set up correctly, whichever module is used.
There are a few tricky concepts in this project, so what follows will be more detail on the most important of those.
Benchmarking a function involves a lot of phases, spread out across the different components, and understanding them is key to understanding the code at all.
The flow for benchmarking a single function contained in an implementation file, with a corresponding reference directory uses the following process:
- parse reference:
<reference> -> ref
- build reference function:
ref, <reference/ref.c> -> run
- build generator:
ref, run -> generator
- generate examples:
generator -> examples
- build real function:
ref, <implementation> -> run'
- build evaluator:
ref, run' -> evaluator
- check examples:
evaluator, examples -> results
Note that this is the idealised flow,
due to sand-boxing the actual flow looks slightly different (and less obvious).
This means building a function (creating a run
) is typically delayed until the function has to be run,
and then a new function will be created specifically for that run.
This is true for both generation and evaluation.
The process for running the benchmark on many references and implementations can also be thought of in phases. These steps look distinct from those laid out for a single function; that procedure is still taking place here, only beneath the surface.
To benchmark many functions this library will:
- retrieve - fetches all folders and files which could be testable
- get all reference folders which contain "ref.c" and "props"
- get all implementation files from the folder with the same name which end with a valid extension
- build - iterates over all folders and files found and converts them to usable objects
- turn each folder into a
FunctionReference
, skip this whole reference if it fails - turn each implementation for that reference into a
Function
, also skip if it fails
- turn each folder into a
- test - tests each reference and store all the results
- build a
Generator
and generate examples for each reference - build an
Evaluator
for each implementation and test on the examples
- build a
Knowing when references and implementations are skipped is tricky, so here is an example flow covering most cases. It will show which entries are being considered after each pass, given the starting directories:
.
+-- refs/
| +-- a/
| | +-- ref.c
| | +-- props
| +-- b/
| | +-- ref.c
| | +-- props
| +-- c/
| | +-- ref.c
| | +-- props
| +-- d/
| | +-- ref.c [invalid]
| | +-- props
| +-- e/
| | +-- props
| +-- f/
| | +-- README.md
| +-- g.txt
|-- impls
| +-- a/
| | +-- a-1.s
| | +-- a-2.s [invalid]
| | +-- a-3.s [always segfaults]
| +-- b/
| | +-- b-1.s [invalid]
| | +-- b-2.s [invalid]
| +-- c/
| +-- d/
| | +-- d-1.s
| | +-- d-2.s
| +-- e/
| | +-- e-1.s
| | +-- e-2.s
| +-- f/
| | +-- f-1.s
| | +-- f-2.s
after fetching:
a/ [a-1.s, a-2.s, a-3.s]
b/ [b-1.s, b-2.s]
d/ [d-1.s, d-2.s]
after building:
a/ [a-1.s, a-3.s]
final results:
PASSES 1/2: a-1.s 50/50 OK
FAILS: 0/2:
TRIVIAL 1/2: a-2.s 0/0 OK
This library provides an ability to generate examples for a reference function. The method for achieving this is simple but still powerful, although it does come with drawbacks. Examples are generated as follows:
- parse the function reference to find out what the I/O specification is
- build an executable version of the function (see executing functions)
- randomly generate an input tuple
- run the function on the random input
- extract the return value and output parameters of that function call
- build an example from that input/output evaluation
- if more examples are required repeat from step 3
In this way many examples can be produced easily. The only dependency of this method is that building the executable in step 2 produces a correct function, I have used GCC as my "ground truth". Therefore, as long as functions compiled with GCC produce the correct output, then examples generated in this way will have the correct result.
This approach is also agnostic to how the inputs are generated, this means that a different technique could be used to sample the input space to test the function more efficiently.
In order to run a reference function, which does not necessarily exist in a valid stand-alone C file, this library interacts with functions as shared objects. Compilation into this format varies between Linux and macOS and this library makes no effort to support multiple platforms, so if this is intended for use on a different system take care. The whole shared object infrastructure seems different for Windows, so even more work may be required to run this library there.
The ctypes library is used to interface with these shared objects, so for more information on the internals visit their site.
The approach this library uses is to compile and extract a function definition from a reference. At this point it is a
callable Python object, which will acts as an interface to the C function. Unfortunately by default the function object
has very little information about the function, so the next step is to provide the function with type information.
"Type information" in this case means registering a class which can transform native Python types into an object
that ctypes
can work with. For scalar types this is simple, however as array types include their sizes these must be
generated dynamically as values are passed to the function. As these ctypes
objects are created dynamically they have
to be registered somewhere if you would like to check their value outside the function call, as is the case with output
parameters. In summary, the procedure for calling a reference function from Python is:
- parse the reference to find out I/O types
- compile the implementation to a shared object and load as a Python object
- generate and attach the input and return types to the function object
- call the function with input arguments
- dynamically convert these native Python arguments to their
ctypes
counterpart - keep track of these
ctypes
objects if their value is required later
- dynamically convert these native Python arguments to their
- extract the return value and output values after the function call
Each function reference has certain properties, in a "props" file, which are used in various ways throughout the code. The grammar for this "props" file is given below. This clarifies what is allowed in a "props" file but, as productions map nicely onto classes, it also gives context for certain types.
PROPS ::= <signature> INFO
INFO ::= OUTPUT | SIZE | CONSTRAINT
OUTPUT ::= "output" <var_name>
SIZE ::= "size" <var_name> "," <var_name>
| "size" <var_name> "," <integer>
| "size" <var_name> "," "{" <expr> "}"
| "size" <var_name> "," <integer> "," "{" <expr> "}"
CONSTRAINT ::= "constraint" <var_name> OP <any value>
| "constraint" "{" <expr> "}"
OP ::= "==" | "!=" | "<" | "<=" | ">" | ">="
As mentioned above certain classes correspond precisely to these productions:
ParamSize
objects are constructed from SIZE productionsVarSize
,ConstSize
,SimpleExprSize
, andExprSize
objects come from each branch in that production
Constraint
objects are constructed from CONSTRAINT productionsParamConstraint
, andGlobalConstraint
objects match each branch respectively too
current known issues are:
- process hangs after finishing
- can CTRL-C if used manually
- could wrap it in a timeout shell script?
- parsing for expressions is not perfect, fails if a comma appears in a simple expression
- could use a proper parser generator (the grammar is already written)
- can change the current parser to handle those expressions better