What makes computer programs fast or slow? To answer this question, this book gets behind the abstractions of programming languages and looks at how a computer really works. It examines and explains a variety of scientific programming models (programming models relevant to scientists) with an emphasis on how programming constructs map to different parts of the computer's architecture. Two themes emerge: program speed and program modularity. Throughout this book, the premise is to "get under the hood," and the discussion is tied to specific programs.
The book digs into linkers, compilers, operating systems, and computer architecture to understand how the different parts of the computer interact with programs. It begins with a review of C/C++ and explanations of how libraries, linkers, and Makefiles work. Programming models covered include Pthreads, OpenMP, MPI, TCP/IP, and CUDA. The emphasis on how computers work leads the reader into computer architecture and occasionally into the operating system kernel. The operating system studied is Linux, the preferred platform for scientific computing. Linux is also open source, which allows users to peer into its inner workings. A brief appendix provides a useful table of machines used to time programs.
The complete text of this book is available in html at this link. The html is published under a Creative Commons license. The copyright remains with the publisher. The html has a few blemishes. For example, occasionally when the same source is split between multiple listings, the line numbering is not continued between the listings as it should be. The html does not have an index, although it is searchable. For a more complete version, see the printed book.
This README document provides context for the source code in this GIT repository, with links to the html text as well as sources in the repository.
- C/C++ Review (html)
- C/C++: Libraries and Makefiles (html)
- The Processor (html)
- Memory (html)
- Threads and Shared Memory (html)
- Special Topic: Networks and Message Passing (html)
- Special Topic: The Xeon Phi Coprocessor (html)
- Special Topic: Graphics Coprocessor (html)
- Appendix: Machines used, Plotting, Python, GIT, Cscope, and gcc (html)
1 C/C++ Review
The review of C/C++ in this chapter attempts to bring out certain features that people often do not learn from a single course or two.
1.1 An example: The Aitken iteration
The Aitken transformation maps a sequence of numbers to another sequence of numbers. It serves as the vehicle to introduce aspects of C, C++, and Fortran in this chapter.
1.1.1 Leibniz series and the logarithmic series
1.1.2 Modular organization of sources
Before we delve into the syntax of C/C++, let us look at how to structure a program for the Aitken iteration.
1.2 C Review
In this review, we go over a few features of C, emphasizing points that introductory classes often miss. Thus, we discuss header files, arrays, and pointers, with emphasis on the distinction between lvalues and rvalues, as well as the distinction between declarations and definitions.
1.2.1 Header files
A header file in C is an interface to one or more source files.
1.2.2 Arrays and pointers
Arrays and pointers are the heart of the C language. In C, arrays and pointers are almost interchangeable.
1.2.3 The Aitken iteration using arrays and pointers
Here we use the Aitken example to illustrate how arrays may be passed as pointers.
1.2.4 Declarations and definitions
Names introduced into a C program are for the most part names of either functions or variables. The names can be introduced as either declarations or definitions.
1.2.5 Function calls and the compilation process
Here we take a look at the mechanism of function calls and the compilation process.
1.3 C++ Review
The C++ language is something of a compromise to provide the facilities of high-level languages without sacrificing the speed of C. Despite its name, it is not an incremental extension of C. It is a colossal expansion of C syntax.
1.3.1 The Vector class
We begin with a general type of class, namely, the Vector class. This class helps us review a few of the features of C++ and is used to implement the Aitken iteration later. In a later chapter, we criticize the use of this class and show it to be slow. Except for this class, our classes are narrowly defined and follow the "one class, one task" principle.
Vector.hh illustrates some of the following:
- Header file with class definition
- Private section
- Member functions
- References
- Operator and function name overloading
- Function call inlining
- Constructors and desctructors
- The const qualifier
- Default arguments
- The -> operator
- Abstraction features of C++
1.3.2 Aitken transformation in C++
1.4 A little Fortran
The syntax is deliberately Fortran 77 and not the newer varieties. When the need to use old Fortran codes arises, it is often Fortran of this variety. We do not recommend programming in Fortran.
2 C/C++: Libraries and Makefiles
- Mixed-language programming (html)
- Using BLAS and LAPACK libraries (html)
- Building programs using GNU Make (html)
- The Fast Fourier Transform (html)
There are two powerful ideas for bringing greater modularity into C/C++ programs, and both of them will be introduced in this chapter. The first idea is to combine object files into libraries, and the second idea is to organize program sources into a source tree.
2.1 Mixed-language programming
We look at the map from sources to object files as a precursor to mixed-language programming. Beyond the transmutation of names, which differs between the three languages, the additional issue of runtime libraries has to be dealt with.
2.1.1 Transmutation of names from source to object files
If the source has a statement such as a=b+c, for example, the names a, b, c typically disappear from the object file. Not all names present in the source disappear, however. Those names present in the source that survive in the object file are some of the most important.
2.1.2 Linking Fortran programs with C and C++
To use Fortran functions within C or C++ programs, the naming used for the Fortran functions in C or C++ has to be cognizant of the way the names in the source files are altered in the object file. We want the names to agree in the object files because it is the object files and not the source files that get linked against each other.
2.2 Using BLAS and LAPACK libraries
BLAS and LAPACK are widely used numerical linear algebra libraries.
2.2.1 Arrays, matrices, and leading dimensions
We look at multidimensional arrays in C/C++ briefly and then at the distinction between column-major format and row-major format.
2.2.2 BLAS and LAPACK
BLAS and LAPACK functions, to which we now turn, typically have long argument lists.
2.2.3 C++ class interface to BLAS/LAPACK
The Vector class is an attempt to capture the general concept of vectors. In contrast, the LU_Solve class of this section is narrowly defined. It does just one thing, which is to provide an easy interface to LAPACK's LU solver functions dgetrf() and dgetrs().
2.3 Building programs using GNU Make
The organization of source files into directories and subdirectories is the heart of modular programming. The make utility provides a method for building a program from its source tree.
2.3.1 The utils/ folder
The modules in utils/ facilitate timing, generation of random numbers, gathering statistics, making tables, and manipulation of double arrays. All the modules in utils/ are used extensively. We avoid mentioning the modules in utils for the most part, but a brief discussion is given here.
test_lusolve.cpp showing how utils are used.
2.3.2 Targets, prerequisites, and dependency graphs
The first purpose of a Makefile is to capture the dependency graph between headers, sources, object files, and executables.
2.3.3 Make variables in makevars.mk
Almost all Makefiles have make variables.
2.3.4 Pattern rules in makevars.mk
The recipes have a formulaic character. For example, if the target is an object file to be built from a C++ source, the recipe generally invokes the C++ compiler specified by CPP using the options listed in CFLAGS. Pattern rules take advantage of the repetitive nature of recipes to simplify their specification.
2.3.5 Phony targets in makevars.mk
Phony targets are always assumed to be out of date.
2.3.6 Recursive make and .d files
When the source and object files required to build a single executable reside in several subdirectories, recursive make may be used to complete the build.
Makefile that is called recursively
2.3.7 Beyond recursive make
Modularity and simplicity are two virtues of recursive make. However, there are several problems with it.
2.3.8 Building your own library
We end our discussion of make by showing how to build and link static and shared libraries. Libraries provide a level of modularity beyond what is possible within a source tree. Any program that is linked against a library in effect treats the external library as a module.
Makefile for building static/shared library
2.4 The Fast Fourier Transform
In this last section, we look at the speed of a few implementations of the Fast Fourier Transform (FFT). Program speed is a major theme of the rest of this book.
2.4.1 The FFT algorithm in outline
The power of 2 FFT moves through lg(N) levels. The number of arithmetic operations in each level consists of N/2 twiddle factor multiplications, N/2 additions, and N/2 subtractions.
2.4.2 FFT using MKL
2.4.3 FFT using FFTW
2.4.4 Cycles and histograms
How many cycles does a one-dimensional complex FFT of dimension 210 = 1024 take? Program performance is influenced by so many factors that the question is too simple to be answered.
2.4.5 Optimality of FFT implementations
What is the fastest possible speed of an FFT implementation? The many system features that intrude into the histograms hint that this question may not have a straightforward answer.
nr.cpp (FFT implementation from Numerical Recipes)
3 The Processor
- Overview of the x86 architecture (html)
- Compiler optimizations (html)
- Optimizing for the instruction pipeline (html)
Even for many simple programs, the compilers of today do not generate optimal code or anything like it. Good, or nearly optimal, programs for modern platforms can be written only with a knowledge of the nuances of the computer’s hardware. With that in mind, this chapter, indeed the rest of this book, will introduce programming models in close concert with computer architecture.
3.1 Overview of the x86 architecture
This section is a basic introduction to the x86 instruction architecture, which has dominated since its introduction in 1978. We look at registers and a little bit of assembly programming.
3.1.1 64-bit x86 architecture
When we write programs, we do so with an awareness of memory. Every variable name is ultimately the name for a segment of memory. However, in most programming, we have no awareness of the registers at all. Registers are very special locations because they are wired into the circuits for carrying out arithmetic and logical operations.
3.1.2 64-bit x86 assembly programming
The reader may find it a little puzzling that “double” (or “long”) is used for 32-bit operands and “quad” for 64-bit operands. A word in the original 8086 machine of 1978 was 16 bits. Thus, double words are 32 bits and quad words are 64 bits, as a result of a choice made long ago.
3.1.3 The Time Stamp Counter
To do anything useful with assembly code, it helps to have a method to make it part of a C or C++ program. The asm directive allows us to embed assembly code in a C or C++ program. RDTSC or rdtsc is an x86 machine instruction for reading the Time Stamp Counter. This instruction can be embedded in C/C++ programs.
3.1.4 Cache parameters and the CPUID instruction
We will take a closer look at cache memory in the next chapter. Here we will use the cpuid instruction to ask the processor to report information about its cache and thus give us a preliminary idea of cache memory.
3.2 Compiler Optimizations
Anyone who seeks to write fast programs must begin by grasping what compilers do. C/C++ programs are turned into machine instructions by the compiler. The speed of the program can vary by a factor of 10, or even more, depending on the instruction stream the compiler generates. While generating machine instructions, compilers can alter the structure of the program considerably.
3.2.1 Preliminaries
Cache flush, which we describe here, is one way to eliminate cache effects. We also review compiler options. The C++ classes PyPlot for plotting, StatVector for gathering simple statistics, and Table for making tables are used extensively in the source code for this book. However, they appear only rarely in the text. All three classes are described in the appendix.
3.2.2 Loop unrolling
Table 3.2↑ shows that turning on compiler optimization doubles the speed of the code. Rewriting leibniz() to enable loop unrolling, which is one of the most important optimizations, doubles the speed once again. We will use the leibniz() function and its variants to explain how loop unrolling works.
3.2.3 Loop fusion
Because the speed of modern x86 computers relies on instruction-level parallelism and vector registers, it is often a good idea to have a lot of parallelism in the innermost loops. Loop fusion addresses the situation where we have two distinct loops. It is assumed that the iterations of each loop have dependencies that make loop unrolling ineffective. In such a scenario, merging the bodies of the two loops may be the best way to produce a loop body that is amenable to instruction-level parallelism. The transformation where distinct loop bodies are merged is called loop fusion.
3.2.4 Unroll and jam
In this section, we will study a compiler optimization that applies to nested loops. So far we have considered loop unrolling and loop fusion. In loop nests, it may be desirable to unroll the outer loop and fuse several copies of the inner loop that result. That is the unroll and jam transformation.
3.2.5 Loop interchange
Loop interchange is the next compiler optimization for discussion, and matrix multiplication is the example we use to bring it out.
3.2.6 C++ overhead
Makefile (showing inlining for C++)
3.2.7 A little compiler theory
The compiler’s view of a program is quite different from that of a human programmer. A human programmer has a problem to be solved and an idea to solve that problem that is expressed as a computer program. What the compiler sees is a sequence of statements that obey the syntactic rules of the programming language. The global view of what the program does is lost.
3.3 Optimizing for the instruction pipeline
Earlier, we found that MKL’s fast Fourier transform and matrix multiplication can be 10 times faster than ordinary C/C++ programs. If the C++ program is written without care, the speedup can even be a factor of 100. What does Intel’s MKL do to be so much faster than ordinary C/C++ programs? The biggest part of the answer to that question is optimizing for the instruction pipeline, which is the topic of the present section.
3.3.1 Instruction pipelines
The automobile assembly line offers a useful point of comparison to clarify concepts. An automobile assembly line may take 48 hours to assemble a single car. However, thanks to pipelining and multiple assembly lines, the factory may produce a car every minute. For such a factory, the latency would be 48 hours and the bandwidth 1 car per minute.
3.3.2 Chipsets
The processors are central to the computer but are only one among the many components that make up a computer or compute node. Some of the other components are DRAM memory, graphics processor, hard disk, solid state storage, network adapter, keyboard, and monitor. To understand how the processor is connected to all the components, we have to look at chipsets. Chipsets are chips used to assemble computers from many components.
3.3.3 Peak floating point performance
In this section, we write a few programs that do nothing meaningful but that get close to the peak performance of 4 flops per cycle using SSE2 instructions. Optimization for the more recent AVX2 instruction set is dealt with in the exercises.
3.3.4 Microkernel for matrix multiplication
In this section, we write programs that multiply matrices of dimensions 4 × n and n × 4 using slightly more than 8n cycles. The microkernel with n = 200 is the basis of matrix multiplication routines given in the next chapter. The programs are limited to SSE2 instructions.
4 Memory
- DRAM and cache memory (html)
- Optimizing memory access (html)
- Reading from and writing to disk (html)
- Page tables and virtual memory (html)
In modern computers, including supercomputers, desktops, laptops, and all kinds of mobile devices, memory invariably refers to DRAM. Although file systems reside on hard disk or some other storage medium, when a program is running, much of the data that is handled is from DRAM.
4.1 DRAM and cache memory
To access a register, the latency is 1 cycle. To access the L1 cache, the latency is 4 cycles. The latency to DRAM, however, can be hundreds of cycles. It is estimated that 40% of the instructions access DRAM, and therefore hiding this large latency to DRAM is a major part of computer architectural design. It is important for the programmer to understand when this latency to DRAM can be hidden.Briefly, the latency to DRAM can be effectively hidden when the data access is sequential and predictable. It cannot be hidden in linked lists and other dynamic data structures because the location of the next item is determined by a link from the present item. However, even in such situations, programming techniques can mitigate the latency to DRAM.
4.1.1 DRAM memory
At the finest level, DRAM is an array of bits. Arrays of bits are organized into banks, ranks, and channels. There is a memory controller on each processor package that drives the memory channels.
4.1.2 Cache memory
An instruction such as movq %rax, %rdx, which moves one register to another, takes a single cycle. However, a load instruction such as movq (%rsi), %rdx, which moves a quad word from memory to a register, can take more than 100 cycles.
4.1.3 Physical memory and virtual memory
Virtual memory is implemented by the operating system and the processor hardware working in concert. Virtual memory is an illusion that simplifies programming, compilation, and linking in addition to keeping programs from interfering with each other. As the program runs, the memory addresses generated by a program are automatically mapped to physical addresses in the DRAM memory.
4.1.4 Latency to DRAM memory: First attempts
In this section, we make our first attempts at measuring the access time to DRAM memory. All our attempts fail, but we are led into certain aspects of the memory system that have a bearing on program performance.
array_walk.hh (accessList in text renamed to chain_walk in code)
4.1.5 Latency to DRAM
Our earlier attempts to measure latency to DRAM memory failed because we did not account for the overhead of creating and accessing page table entries. The more careful program in this section breaks instruction-level parallelism, ensures that none of the cache lines accessed is from L1, L2, or L3 cache, and accesses all of the 256 cache lines within four pages of memory (so that TLB misses are not a factor).
4.2 Optimizing memory access
In this section, we look at optimization of memory access using three examples.
4.2.1 Bandwidth to DRAM
The most predictable and common pattern of memory access is to access a long line of data in sequence. Every memory controller is likely to be optimized to handle that pattern efficiently.
4.2.2 Matrix transpose
The example we study here is out-of-place matrix transpose. It is always best to access data sequentially with unit stride, but when a matrix is transposed to another matrix stored in the same column-major format, there is no way to access both matrices with unit stride. In many problems of this type, it is useful to break up the data into blocks.
4.2.3 Optimized matrix multiplication
Here we assume the matrices to be in DRAM memory and not in cache, and we show how to optimize matrix multiplication. In outline, the basic idea remains to multiply in blocks, but intermediate blocks are stored in scratch space and in convenient formats to minimize TLB and cache misses. As far as possible, data is stored in a format that enables sequential access with unit stride. The actual execution of this idea can make it look more complicated than it is, but the idea is elegant as well as possibly applicable to many other problems.
4.3 Reading from and writing to disk
The data in registers and DRAM memory disappears when the computer is powered off. In contrast, hard disk storage is permanent. The hard disk is a collection of platters. Bits stored on circular tracks on either side of the platters are turned on and off using a magnetic field. More than 100 billion bits can be packed into a single square inch of a platter.
4.3.1 C versus C++
With regard to the programming technique for reading and writing files, the simplest lesson is also the most valuable. The C interface can be much faster than the C++ interface as we show and for reasons we explain.
4.3.2 Latency to disk
The measurement of latency to hard disk brings up issues not unlike the ones encountered in measuring the latency to DRAM memory. The true latency to hard disk is not easily visible to simple computer programs.
4.3.3 Bandwidth to disk
A single array cannot exceed the available DRAM, although it is desirable to work with files much larger than the size of DRAM memory when estimating bandwidth to disk.
4.4 Page tables and virtual memory
In this section, we make our first foray into the operating system kernel. The cost of a single write to memory by an instruction such as movq %rax, (%rsi) brings in many layers of complexity. It could be a cache hit or a cache miss. If the DRAM memory is accessed, the cost of the access depends on preceding and following memory accesses. It depends on the manner in which the memory controllers operate the DRAM devices. It depends on the parallelism in the instruction stream. It depends on the pressure on dispatch ports in the instruction pipeline among other factors. On top of the layers of complexity engineered into hardware, the operating system introduces yet more complexity.
4.4.1 Partitioning the virtual address space
Much of virtual address space is an unclaimed wilderness. Figure 4.8 is an incomplete schematic view of the partitioning of the virtual address space of a typical user process. Much of the virtual address space is taken up by the unutilized regions shown as empty gaps.
4.4.2 Physical address space and page tables
Program speed is influenced by the paging system in several ways. If a memory word is found in L1 cache and its virtual address is found in TLB, the latency of the memory access would be just 4 cycles. However, if there is a TLB miss, the latency can go up to several hundred cycles. If there is a page fault, the latency of a single memory access can go up to millions, even billions, of cycles.
5 Threads and Shared Memory
- Introduction to OpenMP (html)
- Optimizing OpenMP programs html)
- Introduction to Pthreads (html)
- Program memory (html)
Programming with threads is a paradigm of great range and utility that encompasses everything from cell phones to web servers to supercomputers.
5.1 Introduction to OpenMP
OpenMP programs look much like sequential programs, and the syntax is easy to learn. The simplicity is mostly illusory, however. Whenever concurrent programs share memory, as OpenMP programs do, the programming model inevitably becomes very subtle.
5.1.1 OpenMP syntax
OpenMP syntax is beguilingly simple.
5.1.2 Shared variables and OpenMP's memory model
Implicit flushes are a vital part of OpenMP's memory model.
5.1.3 Overheads of OpenMP constructs
The parallel, barrier, and for constructs introduce overheads. Work may need to be assigned to threads, threads may need to be created and destroyed, or synchronization and serialization may need to be implemented using system calls to the operating system kernel. These activities consume cycles. If the parallelized task is too small, the benefits of parallelization will be overwhelmed by the overheads. Effective programming requires knowledge of the overheads of OpenMP constructs.
5.2 Optimizing OpenMP programs
It is commonly believed that a program running on 16 cores is 16 times faster than a program that runs on 1 core, ignoring communication costs. This belief is completely mistaken. Bandwidth to memory does not scale linearly with the number of cores. Although linear speedup is often claimed in scientific computing research, such claims are a consequence of the program not going out of cache, a surprisingly common occurrence. Algorithms that use memory in a nontrivial manner and achieve linear speedup are rare.
5.2.1 Near memory and far memory
It is advantageous if a page frame that is mostly used by a thread resides in near memory. If page frames are in far memory, the speed of the program can degrade by more than a factor of two.
5.2.2 Bandwidth to DRAM memory
Table 5.3 lists the read, write, and copy bandwidths for two different computers. In neither case do we get anything close to linear speedup for reading and copying.
5.2.3 Matrix transpose
The bandwidth realized in transposing is nearly 80% of the bandwidth for copying. Getting to 80% of the best possible in a matrix transpose is quite good.
5.2.4 Fast Fourier transform
The FFT speedups are quite good and reach 75% of linear speedup.
5.3 Introduction to Pthreads
OpenMP is a limited programming model. It applies mainly to those situations where the data is static and the access patterns are regular. Pthreads are a more powerful programming model.
The Pthread interface for creating and running threads is supported by the Linux kernel. In Linux, each thread is a separate process. Thread creation is the same as process creation. The distinguishing feature of a group of threads is that they all use the same page tables.
5.3.1 Pthreads
Pthreads are initiated using functions that take a single argument of type void * and return a single value also of type void *.
5.3.2 Overhead of thread creation
The typical cost of creating and destroying Pthreads appears to be somewhat less than 10**5 cycles. That number is not unreasonable given that each process descriptor used by the kernel is nearly 6 KB. The cost of creating threads will vary from system to system, but the numbers are qualitatively the same on many different systems. The cost of creating a thread per core may be expected to be higher on computers with multiple processor packages.
5.3.3 Parallel regions using Pthreads
Open MP type parallel regions are implemented using Pthreads in this section. The first implementation is plain C, except for creating and launching Pthreads. The second and third implementations use spinlocks and mutexes, respectively. The final implementation uses conditional variables. The C implementation highlights the role of cache coherence, which is essential and fundamental to multithreaded programming. Propagating writes from cache to cache can cause significant overhead. The C implementation also introduces memory fences.
5.4 Program memory
Programs rely on operating systems in many more ways than most programmers realize. The little forays we make into the Linux kernel will help us understand segmentation faults, memory errors, and thread creation.
5.4.1 An easy system call
The operating system kernel offers its services to user programs through system calls. There are system calls for dealing with every single part of the computing system. There are system calls related to the file system, networks, memory management, and process creation and scheduling. Every printf() or malloc() begins life in the C library but finds its way into the operating system kernel through a system call.
linux.kernel.patch.11 (defines a new system call to control printing of messages).
linux.kernel.patch.22 (more modifications to print messages)
5.4.2 Stacks
Stacks are useful for maintaining the state of running processes. This application is so important that the stack is hardwired into the x86 instruction set as well as most other instruction sets.
In Linux, every process gets a kernel mode stack, in addition to its user mode stack. The user mode stack occupies a high region in the user area, and the kernel mode stack is in the kernel area of virtual memory.
5.4.3 Segmentation faults and memory errors
The use of pointers exposes the programmer to errors that corrupt memory. Memory errors can befuddle even expert programmers. An erroneous program with memory errors may work on some systems but crash on others. The point where the program crashes can be far away from the point where the error occurs. The program may work for some inputs but not for others. In multithreaded programs, memory errors may be triggered by race conditions that are hard to reproduce, making debugging difficult. In this section, we take a detailed look at segmentation faults and memory errors.
6 Special Topic: Networks and Message Passing
- MPI: Getting started (html)
- High-performance network architecture (html)
- MPI examples (html)
- The Internet (html)
The principal framework for programming high-performance networks is Message Passing Interface (MPI). MPI is a library that allows processes running concurrently on different nodes to communicate by sending and receiving messages.
The Internet is a different kind of a network from the high-performance clusters targeted by MPI. The Internet’s relevance to emerging areas of science is unquestionable. The huge volumes of investment that are and will flow into the Internet will imply that it is integrated more and more deeply into science and scientific computing. Conversely, one should not forget that the World Wide Web was invented by a physicist.
6.1 MPI: Getting started
MPICH, MVAPICH2, and Open MPI are the three major MPI implementations. The details of compilation, linking, and running vary between these implementations as well as between different sites. However, the general picture is the same.
6.1.1 Initializing MPI
MPI calls itself a fully featured library---with some justice. It has a lot of features. We will use only a thin sliver of MPI.
6.1.2 Unsafe communication in MPI
MPI syntax allows a process to use either blocking or nonblocking function calls to send and receive. A blocking send or receive waits until the operation is complete. Blocking communication is more vulnerable to deadlocks.
6.2 High-performance network architecture
A network connects many computers together. Each computer is an independent entity. Thus, every communication between two computers requires coordination. The coordination is much looser in the vast and decentralized Internet than it is in high-performance networks used in scientific computing.
6.2.1 Fat-tree network
In many high-performance clusters, nodes are networked using fat-trees. A defining property of fat-trees is that the number of links between levels is roughly a constant.
6.2.2 Infiniband network architecture
In packet switched networks, such as Infiniband or the Ethernet, data in DRAM memory is converted to a sequence of packets by the source. The packets are transmitted across the network. The packets are reassembled at the destination and stored once again in DRAM memory. The change in data format is fundamental to packet switched networks.
6.3 MPI examples
In this section, we get more deeply into MPI syntax. Our account of MPI syntax is not extensive. However, every bit of syntax introduced is related to computer architecture. The discussion of Infiniband network architecture in the previous section will help us learn MPI with a double focus on syntax and its effectiveness.
6.3.1 Variants of MPI send and receive
Function calls that send and receive messages are the heart of the MPI library. An MPI user who masters sending and receive messages, but not much more, can get by quite well.
6.3.2 Jacobi iteration
The Jacobi iteration is a technique for solving linear systems that arise when certain partial differential equations are discretized. Jacobi is an iterative method.
6.3.3 Matrix transpose
In this section, we implement the transposition of large matrices stored on several host computers connected by QDR Infiniband network. The number of host computers ranges from 10 to 100. This examples illustrates the importance of optimizing copying into and out of buffers and the manner in which MPI can be made to issue RDMA reads and writes so that network activity occurs in parallel with processor activity.
6.3.4 Collective communication
The MPI library provides several function calls for collective communication. A collective function call must be made by all processes in a group. The result of collective calls such as Bcast, Scatter, Gather, and Alltoall is data transfer involving all the processes.
6.3.5 Parallel I/O in MPI
In parallel I/O, a number of MPI processes simultaneously write to or read from the same file. Typical file systems lock a file when it is being accessed by one process, which precludes other processes from accessing it. Parallel I/O is possible only if the file system allows it.
6.4 The Internet
The main function of the Internet, as it exists today, is to connect people. There is no doubt that the Internet is one of the most successful technologies. The Internet is not the result of a single act of invention. Even now its design is evolving to accommodate new uses and technologies.
6.4.1 IP addresses
The IP address is crucial to the architecture of the Internet. Data leaves the source computer in the form of TCP/IP packets. The TCP/IP packets hop from router to router until they end up at the destination.
6.4.2 Send and receive
Every network interface comes down to sending and receiving data.
6.4.3 Server
The server described here receives a sequence of double-precision numbers from its client and sends back their partial sums.
6.4.4 Client
The client mirrors the server in some ways but differs in others. The client begins by establishing a connection with the server.
6.4.5 Internet latency
All measurements were performed by running the server at the Texas Advanced Computing Center (TACC) and the client at the University of Michigan (UM). The median time for send + recv was close to 55 milliseconds in several trials. Understanding why the latency is 55 milliseconds is a fascinating problem.
6.4.6 Internet bandwidth
The realized bandwidth is determined to a great extent by TCP’s congestion control algorithm. TCP implements both flow control and congestion control. In flow control, the sender keeps track of the available room in the receiver’s buffer. The sender slows down if there is too little room in the receiver’s buffer. The sender continually adjusts its speed to avoid overwhelming (or starving) the receiver with packets.
7 Special Topic: The Xeon Phi Coprocessor
The Intel Many Integrated Cores (MIC) or Xeon Phi coprocessor supplements the processor nodes and increases floating point throughput.
7.1 Xeon Phi architecture
In this section, we give an overview of the Xeon Phi’s architecture in three steps. The first step is to calculate the peak floating point bandwidth of the Phi. The second step is to introduce the thread picker. Finally, we reuse Open MP programs to measure the Phi's latency and bandwidth to memory.
7.1.1 Peak floating point bandwidth
The peak bandwidth is approached by dense matrix multiplication and LU factorization. However, the peak floating point bandwidth is an irrelevant theoretical number for almost every other application, and even the FFT realizes only 15% of the peak, as we will see later.
7.1.2 A simple Phi program
From here onward, we use Phi device and MIC device interchangeably. We begin by explaining why the right number of threads on a MIC/Phi device is typically four times the number of cores.
7.1.3 Xeon Phi memory system
Latency and bandwidth to memory are measured by making slight changes to earlier programs.
7.2 Offload
In offload mode, the master thread of the main program is assumed to be on the host. However, the program holds several segments that are meant to be outsourced to the Phi devices.
Note: offloading may become irrelevant in newer generations of the Phi.
7.2.1 Initializing to use the MIC device
In all the offloading programs, it is assumed that mic_init() is called at the beginning.
7.2.2 The target(mic) declaration specification
The icpc compiler has the ability to compile a function written in C/C++ to produce assembly for both the host computer and MIC device.
7.2.3 Summing the Leibniz series
We will discuss four different programs to sum the Leibniz series to compute pi.Together the programs expose nearly all the offload syntax one needs to know.
7.2.4 Offload bandwidth
The PCIe bus, which connects the host with the MIC devices, has low bandwidth. Its bandwidth is more than an order of magnitude less than the bandwidth to memory of the MIC devices or even the host. Data transfer between the host and the Xeon Phi coprocessors imposes serious limits on the advantages as well as utility of the coprocessor model.
7.3 Two examples: FFT and matrix multiplication
We look at the FFT, the Phi's instruction pipeline, a microkernel in assembly for matrix multiplication, and the MKL library.
7.3.1 FFT
The FFT is a good basis for comparing the Phi against its AVX host. It is one of the most frequently employed scientific algorithms. Unlike in the LINPACK benchmark employed to rate supercomputers, the cost of memory accesses cannot be completely hidden.
7.3.2 Matrix multiplication
Algorithms that are capable of approaching the theoretical peak are not many. These algorithms must involve a great number of arithmetic operations relative to the data they access. In addition, their structure must permit hiding the cost of multiple accesses of the same data item. Dense numerical linear algebra is the main source of such algorithms.
8 Special Topic: Graphics Coprocessor
The graphics libraries provide a number of functions for rendering, shading, texture manipulation, and similar tasks. Graphics devices accelerate the execution of such library functions. In 2007, NVIDIA introduced the Compute Unified Device Architecture (CUDA) framework. CUDA added software layers to the device drivers and the GNU C/C++ compiler to greatly simplify the task of programming graphics coprocessors for scientific use.
8.1 Graphics coprocessor architecture
Basic knowledge of registers, warps, and thread blocks is essential for CUDA programming.
8.1.1 Graphics processor capability
NVIDIA graphics processors come in a great variety. The number of devices that are CUDA enabled and are of compute capability anywhere from 1.0 to 3.5 is more than 100. The first task is to find out exactly which graphics processor is available and what its capability is.
8.1.2 Host and device memory
To set up any computation on the graphics device, data must be transferred from host to device memory. To retrieve the result of a computation on the graphics device, data must be transferred in the other direction from device to host memory.
dhstmem.hh[dhstmem.hh}
8.1.3 Timing CUDA kernels
The simplest programs have much to teach us as long as we time them carefully. Therefore, we begin by looking at mechanisms to time kernels that run on the graphics coprocessor.
8.1.4 Warps and thread blocks
The K20’s design is so different from that of conventional processors that to program it we need to understand a few things about its architecture. The heart of the microarchitecture is the streaming multiprocessor, and there are 13 of these in Kepler.
8.2 Introduction to CUDA
The first program for summing the Leibniz series is perhaps not so much harder to code than the corresponding OpenMP program, which runs on the Phi or any x86 machine. However, the second program, in which the entire sum is found on the K20, is easily 100 times harder to code.
8.2.1 Summing the Leibniz series
Threads are grouped into warps and warps into thread blocks. Thread blocks are arranged in a grid. The hierarchical grouping of threads can never be forgotten in writing CUDA programs.
8.2.2 CUDA compilation
The improvements made by NVIDIA from Tesla to Fermi to Kepler instruction set architectures appear quite substantial. The freedom to make such substantial changes while still being able to run older executables on newer devices is a consequence of just-in-time compilation of PTX embedded in the executables.
8.3 Two examples
In this section, we look at two more examples to better understand the CUDA programming model and the speed of the K20 graphics device.
8.3.1 Bandwidth to memory
Blocking memory accesses is the right thing to do on x86 processors. On the K20 and other graphics devices, blocking will lose more than 90% of the bandwidth. Partly because instructions are executed warp by warp, memory accesses must be interleaved.
8.3.2 Matrix multiplication
To approach peak bandwidth, there is no choice but to optimize for the instruction pipeline. We do not consider such optimization here, although the basic principles of optimizing for the instruction pipeline do not change drastically from what is described in chapter 3.
Accesses of shared memory are considerably faster than global memory accesses.
A Machines Used, Plotting, Python, GIT, Cscope, and gcc
- Machines used (html)
- Plotting in C/C++ and other preliminaries (html)
- C/C++ versus Python versus MATLAB (html)
- GIT (html)
- Cscope (html)
- Compiling with gcc/g++ (html)
In this appendix, we give a list of machines used in this book as well as several pointers for downloading and using the program code.
cln (program to run "make clean" recursively, "cln +h" for help)
A.1 Machines used
A table of machines we used to time programs is presented.
A.2 Plotting in C/C++ and other preliminaries
In this section, we describe C++ classes for plotting, gathering statistics, and making tables. These classes are used throughout the book. However, in almost every instance, the code showing how these facilities are invoked is suppressed.
A.3 C/C++ versus Python versus MATLAB
How much faster C/C++ can be relative to interpreted languages such as Python and MATLAB is often not understood.
A.4 GIT
The git utility is a tool for managing sources. Although there are many facilities in GIT, GIT can be grasped easily by paying attention to its internal design.
A.5 Cscope
The cscope utility is invaluable for browsing source code.
A.6 Compiling with gcc/g++
The Makefiles in this GIT repository use Intel compilers for the most part. The switch to gcc/g++ is not too complicated