Skip to content
ankurdave edited this page Jan 12, 2012 · 27 revisions

Introduction

From a user's point of view, debugging a general distributed program can be tedious and confusing. Many distributed programs are nondeterministic; their outcome depends on the interleaving between computation and message passing across multiple machines. Also, the fact that a program is running on a cluster of hundreds or thousands of machines means that it's hard to understand the program state and pinpoint the location of problems.

In order to tame nondeterminism, a distributed debugger has to log a lot of information, imposing a serious performance penalty on the application being debugged.

But the Spark programming model lets us provide replay debugging for almost zero overhead. Spark programs are a series of RDDs and deterministic transformations, so when debugging a Spark program, we don't have to debug it all at once -- instead, we can debug each transformation individually. Broadly, the debugger lets us do the following two things:

  • Recompute and inspect intermediate RDDs after the program has finished.
  • Re-run a particular task in a single-threaded debugger to find exactly what went wrong.

At least in theory, debugging a Spark program is now as easy as debugging a single-threaded one.