-
Notifications
You must be signed in to change notification settings - Fork 9
Home
In this repository, we specify a formal semantics of Viper, a high-level language, on the top of KEVM. This is a stepping stone to the goal of "developing a formally verified compiler for a tiny language [such as Viper]" as mentioned in https://blog.ethereum.org/2016/09/01/formal-methods-roadmap/
Viper is a new experimental language for the smart contracts running on EVM. It is designed by Ethereum foundation, supposedly being simpler and more secure language than Solidity by factoring out the essential components in writing smart contracts, based on their experience of deploying and operating hundreds of thousands of smart contracts over the last two years. The design goal emphasizes security over efficiency, constant-bounded time function execution, and simplicity.
Specifically, Viper is a kind of object-oriented language in the sense that a smart contract is represented as a class whose data fields hold the contract states and methods provide the contract functionalities. Notably, however, it has no explicit inheritance --- one of the main design decision made based on their extensive experience of writing smart contracts. They found that security is more important than programming convenience, thus the minimality has obvious advantages.
Viper claims to be strongly and statically typed, with support for type inference.
Types consist of primitive types (num
of 128-bits signed integers with overflow exception, num256
of 256-bits unsigned integers with overflow exception, decimal
with 128-bits of integer numbers and 10-bits of fractional numbers, bool
, address
of 20-bytes, and bytes32
), and compound types (fixed-size arrays, structs, and maps with primitive typed keys).
A local variable has the function scope (i.e., no variable hiding in local blocks) but no declaration is needed, a decision that complicates code readability.
It does not allow recursive functions nor unbounded-loops, but only bounded-loops and break
statement.
It adopts Python syntax, e.g., being white-space sensitive.
Specifying Viper is challenging due to its experimental status. Viper does not have an official specification document. It derives its specification mainly from its compiler to EVM, which is usually in flux. Therefore, we took a snapshot (a886850c1d) of the latest version of the compiler at the time of writing, and inferred the semantics via reverse-engineering the implementation and communicating with the language designers.
Using KEVM, we were able to capture the Viper language semantics by formalizing the compiler translation from Viper to EVM. This "semantics by translation" approach also helps to emphasize the essence of Viper semantics by factoring it out from EVM semantics.
Our formalization of the translation from Viper to EVM mirrors the two-stage approach of the Viper's own compiler, with the first stage translating Viper to LLL and the second translating LLL to EVM.
LLL is an intermediate Lisp-like language originally designed for the Solidity compiler, but is much simpler and lower-level than Lisp and closer to EVM. An LLL expression (like an S-expression) is a list of an operator symbol followed by zero or more arguments. Some operators represent EVM opcodes with the same name. Only a fragment of LLL is used for the intermediate representation of Viper.
The Viper to LLL translation mainly consists of: lowering function calls using the ABI encoding; encoding logical data structures such as structs into byte-array representation (with 32-byte alignment); allocating memory and storage for local and state variables, resp.; inlining builtin-functions; type checking; and inserting runtime checks for arithmetic overflow. The control-flow statements are simply translated to that of LLL.
The LLL to EVM translation is rather straightforward.
An LLL expression with the EVM opcode is translated to a sequence of EVM opcodes that amounts to evaluating the arguments and pushing the results into the stack in the reverse order followed by the corresponding EVM opcode.
A control-flow operator such as if
and repeat
is translated in the usual way using the jump opcodes.
We found that the Viper language is closely bound to EVM, and inevitably inherits its limitations. We believe that a high-level language such as Viper should have a standalone, abstract semantics that is completely separate from the semantics of the underlying compilation target (EVM in this instance). This helps to better understand and reason about the language semantics, making it easier to apply the language to different targets. Also, this separation of concerns assists formal reasoning. Ideally, one should verify the high-level logic of a Viper program at a high level of abstraction, and then prove that the original Viper program simulates the translated EVM code, i.e., the EVM code does not admit exceptional behaviors (including EVM-specific low-level exceptions) until they are defined in the Viper program.
For example, an internal call (i.e., an intra-contract call) propagates the occurrence of exception all the way up to the root caller. This design seems to be due to quick prototyping for the compiler to EVM. However, we think that the language should provide a way of properly catching exceptions, and leave the logic of exception handling to the programmer.
Another example is the address space of maps. A map of Viper is a total function whose domain is the entire address space of the local memory, which may lead to conflict with other maps. Currently, Viper simply relies on a hashing scheme to avoid the conflict, hoping that the hash collision will not happen, but in theory, the collision may happen and the conflict can occur. Despite the fact that the probability of the hash collision is very low and the blockchain will not work without the underlying probabilistic guarantee, we do believe that a high-level language semantics should not depend on that, but rather admit a purely logical map semantics and let its compiler properly implement it, e.g., by having an extra protection layer for detecting and resolving the hash collision.
While formalizing the language semantics, we found several bugs in the production Viper compiler, including a critical security vulnerability for smart contracts. Those bugs (listed below) were reported, confirmed, and fixed.
We omit details of the security critical bug mentioned above; at the time of writing, the Viper team is still working on resolving the issue, which has the potential to cause major loss of deployed funds on a wide variety of contracts. We plan to release information on the bug found after a suitable patch has been applied to the Viper compiler.
One example bug was found in typecasting, where the compiler failed to introduce a runtime check.
For example, as_num256
is a typecast to num256
, a 256-bits unsigned integer, and is not supposed to allow negative input since it will be coerced to a quite different positive number.
The compiler statically rejects invalid casts when they are detected at compile time, but failed to add the run-time check when they are not.
For example, as_num256(-1)
is statically rejected, but as_num256(x - y)
was compiled to x - y
even if x < y
.
Another example bug was found in compiling the hash expression, sha3(e)
.
For example, sha3(0)
is supposed to be translated into EVM code, PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 32, PUSH1, 192, SHA3
, which means that the code first stores the value to be hashed, '0', in the local memory at the pre-designated location (here 192), and then computes SHA3 hash of the memory range (with starting location 192 and width 32, i.e., MEM[192..223]
).
But the actually generated code was: PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 192, PUSH1, 32, SHA3
(i.e., PUSH1 192
and PUSH1 32
were flipped), which means that it computes SHA3 hash of the wrong memory range (with starting location 32 and width 192, i.e., MEM[32..223]
).
This seemingly obvious bug had not been found as the following reason. Fortunately (or unfortunately) the unintended memory range MEM[32...191]
that is prepended to the value to be hashed (i.e., MEM[192...223]
) is fixed over all Viper programs.
Indeed, they are pre-occupied to store five values of size limits used to clamp values: ADDRSIZE
, MAXNUM
, MINNUM
, MAXDECIMAL
, MINDECIMAL
.