Skip to content
Philip Daian edited this page Dec 5, 2017 · 22 revisions

Semantics of Viper in K

In this repository, we specify a formal semantics of Viper, a high-level language for smart contracts, on the top of KEVM. This is a stepping stone to the goal of "developing a formally verified compiler for a tiny language [such as Viper]" as mentioned in https://blog.ethereum.org/2016/09/01/formal-methods-roadmap/

Viper Language

Viper is a new experimental language for smart contracts running on EVM. It is designed with the goal of being a simpler and more secure alternative to Solidity. The design goals emphasize security over efficiency, constant-bounded time function execution, and simplicity.

Viper is a "semi object-oriented" language: a class represents a contract's blueprint, and the class's fields represent the contract's data and functionality. An object represents a unique contract class instance on the blockchain. However, Viper does not support explicit inheritance, expressing the opinion that smart contracts should be self contained and reasoned about as a unit.

Viper claims to be strongly and statically typed, with support for type inference. Types consist of primitive types (num of 128-bit signed integers, num256 of 256-bit unsigned integers, decimal with 128-bit integer and 10-bit fractional numbers, bool, address with a size of 20 bytes, and bytes of byte-arrays), and compound types (fixed-size arrays, structs, and maps with primitive typed keys). Unlike EVM, an arithmetic overflow triggers a dynamic guard, throwing an exception.

A local variable has the function scope (i.e., no variable hiding in local blocks) but no declaration is needed, a decision that complicates code readability. It does not allow recursive functions nor unbounded loops, but allows bounded loops. The language syntax is influenced by that of Python.

Viper Semantics Specification

Specifying Viper is challenging due to its experimental status. Viper does not have an official full specification document, either English-language or formal. It derives its specification mainly from its compiler to EVM, which is usually in flux. Therefore, we took a snapshot (a886850c1d) of the latest version of the compiler at the time of writing, and inferred the semantics via reverse-engineering the implementation and communicating with the language designers.

Using KEVM, we were able to capture the Viper language semantics by formalizing the compiler translation from Viper to EVM. This "semantics by translation" approach helps emphasize the core Viper semantics without re-defining EVM-specific constructs.

Our formalization of the translation from Viper to EVM mirrors the two-stage approach of the Viper's own compiler, with the first stage translating Viper to LLL and the second translating LLL to EVM.

LLL is an intermediate Lisp-like language originally designed for the Solidity compiler, but is much simpler and lower-level than Lisp and closer to EVM. An LLL expression (like an S-expression) is a list of an operator symbol followed by zero or more arguments. Some operators represent EVM opcodes with the same name. Only a fragment of LLL is used for the intermediate representation of Viper.

Viper to LLL translation

The Viper to LLL translation mainly consists of:

  • lowering function calls using the ABI encoding
  • encoding logical data structures such as structs into byte-array representation (with 32-byte alignment)
  • allocating memory and storage for local and state variables, respectively
  • inlining builtin-functions
  • type checking
  • inserting runtime checks for arithmetic overflow.

The control-flow statements are simply translated to those of LLL.

LLL to EVM translation

The LLL to EVM translation is rather straightforward. An LLL expression with an EVM opcode is translated to a sequence of EVM opcodes that amounts to evaluating the arguments and pushing the results into the stack in the reverse order, followed by the EVM opcode. A control-flow operator such as if and repeat is translated in the usual way using the jump opcodes.

Discussion on Language Design

We found that the Viper language is closely bound to EVM, and inevitably inherits its limitations. We believe that a high-level language such as Viper should have a standalone, abstract semantics that is completely separate from the semantics of the underlying compilation target (EVM in this instance). This helps to better understand and reason about the language semantics, making it easier to apply the language to different targets. Also, this separation of concerns assists formal reasoning. Ideally, one should verify the high-level logic of a Viper program at a high level of abstraction, and then prove that the original Viper program simulates the translated EVM code, i.e., the EVM code does not admit exceptional behaviors (including EVM-specific low-level exceptions) until they are defined in the Viper program.

For example, an internal call (i.e., an intra-contract call) currently propagates the occurrence of a call failure all the way up to the root caller. This design seems to be due to quick prototyping for the compiler to EVM. However, we think that the language should provide a way of properly catching call failures, and leave the logic of exception handling to the programmer.

Another example is the address space of maps. A map of Viper is a total function whose domain is the entire address space of the local memory, which may lead to conflict with other maps. Currently, Viper simply relies on a hashing scheme to avoid the conflict, hoping that the hash collision will not happen, but in theory, the collision may happen and the conflict can occur. Despite the fact that the probability of the hash collision is very low and the blockchain will not work without the underlying probabilistic guarantee, we do believe that a high-level language semantics should not depend on that, but rather admit a purely logical map semantics and let the compiler properly implement it, e.g., by having an extra protection layer for detecting and resolving the hash collision.

Discovering Bugs in the Viper Compiler

While formalizing the language semantics, we found several bugs in the production Viper compiler, including a critical security vulnerability for smart contracts. Those bugs (listed below) were reported, confirmed, and fixed. Such bugfinding is a normal consequence of the formalization and independent review of any language, and the majority of bugs were likely not security critical (apart from perhaps simplifying some antipatterns and allowing for easier intentionally malicious disguised code).

We omit details of the security critical bug mentioned above; at the time of writing, the Viper team is still working on resolving the issue, which has the potential to cause major loss of deployed funds on a wide variety of contracts. We plan to release information on the bug found after a suitable patch has been applied to the Viper compiler.

One example bug was found in typecasting, where the compiler failed to introduce a runtime check. For example, as_num256 is a typecast to num256, a 256-bits unsigned integer, and is not supposed to allow negative input since it will be coerced to a likely unexpected positive number. The compiler statically rejects invalid casts when they are detected at compile time, but failed to add the run-time check when they are not. For example, as_num256(-1) is statically rejected, but as_num256(x - y) was compiled to x - y even if x < y.

Another example bug was found in compiling the hash expression, sha3(e). For example, sha3(0) should be translated into the EVM code PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 32, PUSH1, 192, SHA3. The code first stores the value to be hashed, '0', in the local memory at the pre-designated location (here 192), and then computes SHA3 hash of the memory range (with starting location 192 and width 32, i.e., MEM[192..223]). But, the Viper compiler generated the output PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 192, PUSH1, 32, SHA3 (i.e., PUSH1 192 and PUSH1 32 were flipped). This code computes the SHA3 hash of the wrong memory range (with starting location 32 and width 192, i.e., MEM[32..223]). Fortunately (or unfortunately) the unintended memory range MEM[32...191] that is prepended to the value to be hashed (i.e., MEM[192...223]) is fixed over all Viper programs. These memory locations are pre-populated with five constant values of size limits used to clamp values: ADDRSIZE, MAXNUM, MINNUM, MAXDECIMAL, MINDECIMAL. A logical compiler optimization which only stores the values required in the program being compiled would have the potential to introduce unexpected, difficult to test for and detect behavior, and could pose a serious security risk.

In the process of our formalization, the following bugs were reported to the Viper team (non exhaustive):

Clone this wiki locally