Skip to content
Daejun Park edited this page Nov 27, 2017 · 22 revisions

Semantics of Viper in K

In this repository, we specify a formal semantics of Viper, a high level language, on the top of KEVM. This is a stepping stone to the goal of "developing a formally verified compiler for a tiny language [such as Viper]" as mentioned in https://blog.ethereum.org/2016/09/01/formal-methods-roadmap/

Viper Language

Viper is a new experimental language for the smart contracts running on EVM. It is designed by Ethereum foundation, supposedly being simpler and more secure language than Solidity by factoring out the essential components in writing smart contracts, based on their experience of deploying and operating hundreds of thousands of smart contracts over the last two years. The design goal emphasizes security over efficiency, constant-bounded time function execution, and simplicity.

Specifically, Viper is a kind of object-oriented language in the sense that a smart contract is represented as a class whose data fields hold the contract states and methods provide the contract functionalities. Notably, however, it has no explicit inheritance --- one of the main design decision made based on their extensive experience of writing smart contracts. They found that security is more important than programming convenience, thus the minimality has obvious advantages.

Viper claims to be strongly and statically typed, with support for type inference. Types consist of primitive types (num of 128-bits signed integers with overflow exception, num256 of 256-bits unsigned integers with overflow exception, decimal with 128-bits of integer numbers and 10-bits of fractional numbers, bool, address of 20-bytes, and bytes32), and compound types (fixed-size arrays, structs, and maps with primitive typed keys).

A local variable has the function scope (i.e., no variable hiding in local blocks) but no declaration is needed, a decision that complicates code readability. It does not allow recursive functions nor unbounded-loops, but only bounded-loops with break. It adopts Python syntax, e.g., being white space sensitive.

Viper Semantics Specification

Specifying Viper is challenging due to its experimental status. Viper does not have an official specification document. It derives its specification mainly from its compiler to EVM, which is usually in flux. Therefore, we took a snapshot (a886850c1d) of the latest version of the compiler at the time of writing, and inferred the semantics via reverse-engineering the implementation and communicating with the language designers.

Using KEVM, we were able to capture the Viper language semantics by formalizing the compiler translation from Viper to EVM. This "semantics by translation" approach also helps to emphasize the essence of Viper semantics by factoring it out from EVM semantics.

Our formalization of the translation from Viper to EVM mirrors the two stage approach of the Viper's own compiler, with the first stage translating Viper to LLL and the second translating LLL to EVM.

LLL is an intermediate Lisp-like language originally designed for the Solidity compiler, but is much simpler and lower-level than Lisp and closer to EVM. An LLL expression (like an S-expression) is a list of an operator symbol followed by zero or more arguments. Some operators represent EVM opcodes with the same name. Only a fragment of LLL is used for the intermediate representation of Viper.

Viper to LLL translation

The Viper to LLL translation mainly consists of: lowering function calls using the ABI encoding; encoding logical data structures such as structs into byte-array representation (with 32-byte alignment); allocating memory and storage for local and state variables, resp.; inlining builtin-functions; type checking; and inserting runtime checks for arithemtic overflow. The control-flow statements are simply translated to that of LLL.

LLL to EVM translation

The LLL to EVM translation is rather straightforward. An LLL expression with the EVM opcode is translated to a seqeunce of EVM opcodes that amounts to evaluating the arguments and pushing the results into the stack in the reverse order followed by the corresponding EVM opcode. A control-flow operator such as if and repeat is translated in the usual way using the jump opcodes.

Viper language design issue

We found that Viper language is tied too much to EVM, inevitably inheriting the limitations of EVM. We stongly believe that such a high-level language as Viper should have a standalone, abstract semantics that is separate from its instantiation (i.e., translation) to its low-level target language(s) such as EVM. This helps to understand and reason about the language semantics in the proper level, as well as make it easier to apply the language to different targets. Also, this separation of concerns will help to formally verify Viper programs. One can verify the high-level logic of the smart contract in the right level of abstraction, and then prove that the Viper program simulates the translated EVM code, i.e., the EVM code does not admit exceptional behaviors (including the EVM-specific low-level exceptions) until they are defined in the original Viper program.

We discuss a couple of specific design issues to be improved.

Address space of maps. A map of Viper is a total function with domain of the entire address space of the local memory, which may lead to conflict between different maps. Currently, Viper simply relies on the hashing scheme to avoid the conflict, hoping the hash collision will not happen, but in theory, it can be broken thus not ultimately reliable. Despite the fact that the probability of the hash collision is very low and if that happens the entire block chain will be broken, we do believe that the language should provide an extra protection layer to completely prevent the conflict even in the presence of the hash collision. For example, monitoring the hashed values can detect the hash collision and the value can be re-hashed with a differnt salt if that happens.

Currently, the internal calls propagate exceptions all the way up to the root callers, which we think is induced by EVM semantics. However, we think that Viper should provide a way of catching those exceptions so that one can have more control of handling those exceptional cases in a secure way.

Bugs found

While formalizing the lauguage semantics, we found several bugs of Viper compiler, including a critical security vulnerability that is being fixed by Viper team. Those bugs were reported, confirmed, and fixed.

We cannot disclose the security vulnerability until it is fixed. Below we discuss a couple of the other bugs found.

One example bug was found in typecasting, where the compiler failed to introduce a runtime check. For example, as_num256 is a typecast to num256, a 256-bits unsigned integer, and is not supposed to allow negative input since it will be coerced to a quite different positive number. The compiler statically rejects invalid casts when they are detected at compile time, but failed to add the run-time check when they are not. For example, as_num256(-1) is statically rejected, but as_num256(x - y) was compiled to x - y even if x < y.

Another example bug was found in compiling the hash expression, sha3(e). For example, sha3(0) is supposed to be translated into EVM code, PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 32, PUSH1, 192, SHA3, which means that the code first stores the value to be hashed, '0', in the local memory at the pre-designated location (here 192), and then computes SHA3 hash of the memory range (with starting location 192 and width 32, i.e., MEM[192..223]). But the actually genereated code was: PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 192, PUSH1, 32, SHA3 (i.e., PUSH1 192 and PUSH1 32 were flipped), which means that it computes SHA3 hash of the wrong memory range (with starting location 32 and witdh 192, i.e., MEM[32..223]). This seemingly obvious bug had not been found as the following reason. Fortunately (or unfortunately) the unintended memory range MEM[32...191] that is prepended to the value to be hashed (i.e., MEM[192...223]) is fixed over all Viper programs. Indeed, they are pre-occupied to store five values of size limits used to clamp values: ADDRSIZE, MAXNUM, MINNUM, MAXDECIMAL, MINDECIMAL.

Clone this wiki locally