-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathqqz.tex
7895 lines (7177 loc) · 306 KB
/
qqz.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\QuickQAC{chp:Introduction}{Introduction}
\QuickQ{}
Come on now!!!
Parallel programming has been known to be exceedingly
hard for many decades.
You seem to be hinting that it is not so hard.
What sort of game are you playing?
\QuickA{}
If you really believe that parallel programming is exceedingly
hard, then you should have a ready answer to the question
``Why is parallel programming hard?''
One could list any number of reasons, ranging from deadlocks to
race conditions to testing coverage, but the real answer is that
{\em it is not really all that hard}.
After all, if parallel programming was really so horribly difficult,
how could a large number of open-source projects, ranging from Apache
to MySQL to the Linux kernel, have managed to master it?
A better question might be: ''Why is parallel programming {\em
perceived} to be so difficult?''
To see the answer, let's go back to the year 1991.
Paul McKenney was walking across the parking lot to Sequent's
benchmarking center carrying six dual-80486 Sequent Symmetry CPU
boards, when he suddenly realized that he was carrying several
times the price of the house he had just purchased.\footnote{
Yes, this sudden realization {\em did} cause him to walk quite
a bit more carefully.
Why do you ask?}
This high cost of parallel systems meant that
parallel programming was restricted to a privileged few who
worked for an employer who either manufactured or could afford to
purchase machines costing upwards of \$100,000 --- in 1991 dollars US.
In contrast, in 2006, Paul finds himself typing these words on a
dual-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains
2GB of main memory, a 60GB disk drive, a display, Ethernet,
USB ports, wireless, and Bluetooth.
And the laptop is more than an order of magnitude cheaper than
even one of those dual-80486 CPU boards, even before taking inflation
into account.
Parallel systems have truly arrived.
They are no longer the sole domain of a privileged few, but something
available to almost everyone.
The earlier restricted availability of parallel hardware is
the \emph{real} reason that parallel programming is considered
so difficult.
After all, it is quite difficult to learn to program even the simplest
machine if you have no access to it.
Since the age of rare and expensive parallel machines is for the most
part behind us, the age during which
parallel programming is perceived to be mind-crushingly difficult is
coming to a close.\footnote{
Parallel programming is in some ways more difficult than
sequential programming, for example, parallel validation
is more difficult.
But no longer mind-crushingly difficult.}
\QuickQ{}
How could parallel programming \emph{ever} be as easy
as sequential programming?
\QuickA{}
It depends on the programming environment.
SQL~\cite{DIS9075SQL92} is an underappreciated success
story, as it permits programmers who know nothing about parallelism
to keep a large parallel system productively busy.
We can expect more variations on this theme as parallel
computers continue to become cheaper and more readily available.
For example, one possible contender in the scientific and
technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common
matrix operations.
Finally, on Linux and UNIX systems, consider the following
shell command:
{\small \tt get\_input | grep "interesting" | sort}
This shell pipeline runs the \co{get_input}, \co{grep},
and \co{sort} processes in parallel.
There, that wasn't so hard, now was it?
\QuickQ{}
Oh, really???
What about correctness, maintainability, robustness, and so on?
\QuickA{}
These are important goals, but they are just as important for
sequential programs as they are for parallel programs.
Therefore, important though they are, they do not belong on
a list specific to parallel programming.
\QuickQ{}
And if correctness, maintainability, and robustness don't
make the list, why do productivity and generality?
\QuickA{}
Given that parallel programming is perceived to be much harder
than is sequential programming, productivity is tantamount and
therefore must not be omitted.
Furthermore, high-productivity parallel-programming environments
such as SQL have been special purpose, hence generality must
also be added to the list.
\QuickQ{}
Given that parallel programs are much harder to prove
correct than are sequential programs, again, shouldn't
correctness \emph{really} be on the list?
\QuickA{}
From an engineering standpoint, the difficulty in proving
correctness, either formally or informally, would be important
insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they
are subsumed under the ``productivity'' rubric.
\QuickQ{}
What about just having fun?
\QuickA{}
Having fun is important as well, but, unless you are a hobbyist,
would not normally be a \emph{primary} goal.
On the other hand, if you \emph{are} a hobbyist, go wild!
\QuickQ{}
Are there no cases where parallel programming is about something
other than performance?
\QuickA{}
There are certainly cases where the problem to be solved is
inherently parallel, for example, Monte Carlo methods and
some numerical computations.
Even in these cases, however, there will be some amount of
extra work managing the parallelism.
\QuickQ{}
Why all this prattling on about non-technical issues???
And not just \emph{any} non-technical issue, but \emph{productivity}
of all things?
Who cares?
\QuickA{}
If you are a pure hobbyist, perhaps you don't need to care.
But even pure hobbyists will often care about how much they
can get done, and how quickly.
After all, the most popular hobbyist tools are usually those
that are the best suited for the job, and an important part of
the definition of ``best suited'' involves productivity.
And if someone is paying you to write parallel code, they will
very likely care deeply about your productivity.
And if the person paying you cares about something, you would
be most wise to pay at least some attention to it!
Besides, if you \emph{really} didn't care about productivity,
you would be doing it by hand rather than using a computer!
\QuickQ{}
Given how cheap parallel hardware has become, how can anyone
afford to pay people to program it?
\QuickA{}
There are a number of answers to this question:
\begin{enumerate}
\item Given a large computational cluster of parallel machines,
the aggregate cost of the cluster can easily justify
substantial developer effort, because the development
cost can be spread over the large number of machines.
\item Popular software that is run by tens of millions of users
can easily justify substantial developer effort,
as the cost of this development can be spread over the tens
of millions of users.
Note that this includes things like kernels and system
libraries.
\item If the low-cost parallel machine is controlling the operation
of a valuable piece of equipment, then the cost of this
piece of equipment might easily justify substantial
developer effort.
\item If the software for the low-cost parallel produces an
extremely valuable result (e.g., mineral exploration),
then the valuable result might again justify substantial
developer cost.
\item Safety-critical systems protect lives, which can clearly
justify very large developer effort.
\item Hobbyists and researchers might seek knowledge, experience,
fun, or glory rather than mere money.
\end{enumerate}
So it is not the case that the decreasing cost of hardware renders
software worthless, but rather that it is no longer possible to
``hide'' the cost of software development within the cost of
the hardware, at least not unless there are extremely large
quantities of hardware.
\QuickQ{}
This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
\QuickA{}
This is eminently achievable.
The cellphone is a computer that can be used to make phone
calls and to send and receive text messages with little or
no programming or configuration on the part of the end user.
This might seem to be a trivial example at first glance,
but if you consider it carefully you will see that it is
both simple and profound.
When we are willing to sacrifice generality, we can achieve
truly astounding increases in productivity.
Those who cling to generality will therefore fail to set
the productivity bar high enough to succeed in production
environments.
\QuickQ{}
What other bottlenecks might prevent additional CPUs from
providing additional performance?
\QuickA{}
There are any number of potential bottlenecks:
\begin{enumerate}
\item Main memory. If a single thread consumes all available
memory, additional threads will simply page themselves
silly.
\item Cache. If a single thread's cache footprint completely
fills any shared CPU cache(s), then adding more threads
will simply thrash the affected caches.
\item Memory bandwidth. If a single thread consumes all available
memory bandwidth, additional threads will simply
result in additional queuing on the system interconnect.
\item I/O bandwidth. If a single thread is I/O bound,
adding more threads will simply result in them all
waiting in line for the affected I/O resource.
\end{enumerate}
Specific hardware systems might have any number of additional
bottlenecks.
\QuickQ{}
What besides CPU cache capacity might require limiting the
number of concurrent threads?
\QuickA{}
There are any number of potential limits on the number of
threads:
\begin{enumerate}
\item Main memory. Each thread consumes some memory
(for its stack if nothing else), so that excessive
numbers of threads can exhaust memory, resulting
in excessive paging or memory-allocation failures.
\item I/O bandwidth. If each thread initiates a given
amount of mass-storage I/O or networking traffic,
excessive numbers of threads can result in excessive
I/O queuing delays, again degrading performance.
Some networking protocols may be subject to timeouts
or other failures if there are so many threads that
networking events cannot be responded to in a timely
fashion.
\item Synchronization overhead.
For many synchronization protocols, excessive numbers
of threads can result in excessive spinning, blocking,
or rollbacks, thus degrading performance.
\end{enumerate}
Specific applications and platforms may have any number of additional
limiting factors.
\QuickQ{}
Are there any other obstacles to parallel programming?
\QuickA{}
There are a great many other potential obstacles to parallel
programming.
Here are a few of them:
\begin{enumerate}
\item The only known algorithms for a given project might
be inherently sequential in nature.
In this case, either avoid parallel programming
(there being no law saying that your project \emph{has}
to run in parallel) or invent a new parallel algorithm.
\item The project allows binary-only plugins that share the same
address space, such that no one developer has access to
all of the source code for the project.
Because many parallel bugs, including deadlocks, are
global in nature, such binary-only plugins pose a severe
challenge to current software development methodologies.
This might well change, but for the time being, all
developers of parallel code sharing a given address space
need to be able to see \emph{all} of the code running in
that address space.
\item The project contains heavily used APIs that were designed
without regard to
parallelism~\cite{HagitAttiya2011LawsOfOrder}.
Some of the more ornate features of the System V
message-queue API form a case in point.
Of course, if your project has been around for a few
decades, and if its developers did not have access to
parallel hardware, your project undoubtedly has at least
its share of such APIs.
\item The project was implemented without regard to parallelism.
Given that there are a great many techniques that work
extremely well in a sequential environment, but that
fail miserably in parallel environments, if your project
ran only on sequential hardware for most of its lifetime,
then your project undoubtably has at least its share of
parallel-unfriendly code.
\item The project was implemented without regard to good
software-development practice.
The cruel truth is that shared-memory parallel
environments are often much less forgiving of sloppy
development practices than are sequential environments.
You may be well-served to clean up the existing design
and code prior to attempting parallelization.
\item The people who originally did the development on your
project have since moved on, and the people remaining,
while well able to maintain it or add small features,
are unable to make ``big animal'' changes.
In this case, unless you can work out a very simple
way to parallelize your project, you will probably
be best off leaving it sequential.
That said, there are a number of simple approaches that
you might use
to parallelize your project, including running multiple
instances of it, using a parallel implementation of
some heavily used library function, or making use of
some other parallel project, such as a database.
\end{enumerate}
One can argue that many of these obstacles are non-technical
in nature, but that does not make them any less real.
In short, parallelization of a large body of code
can be a large and complex effort.
As with any large and complex effort, it makes sense to
do your homework beforehand.
\QuickQ{}
Where are the answers to the Quick Quizzes found?
\QuickA{}
In Appendix~\ref{chp:Answers to Quick Quizzes} starting on
page~\pageref{chp:Answers to Quick Quizzes}.
Hey, I thought I owed you an easy one!
\QuickQ{}
Some of the Quick Quiz questions seem to be from the viewpoint
of the reader rather than the author.
Is that really the intent?
\QuickA{}
Indeed it is!
Many are questions that Paul E. McKenney would probably have
asked if he was a novice student in a class covering this material.
It is worth noting that Paul was taught most of this material by
parallel hardware and software, not by professors.
In Paul's experience, professors are much more likely to provide
answers to verbal questions than are parallel systems,
Watson notwithstanding.
Of course, we could have a lengthy debate over which of professors
or parallel systems provide the most useful answers to these sorts
of questions,
but for the time being let's just agree that usefulness of
answers varies widely across the population both of professors
and of parallel systems.
Other quizzes are quite similar to actual questions that have been
asked during conference presentations and lectures covering the
material in this book.
A few others are from the viewpoint of the author.
\QuickQ{}
These Quick Quizzes just are not my cup of tea.
What do you recommend?
\QuickA{}
There are a number of alternatives available to you:
\begin{enumerate}
\item Just ignore the Quick Quizzes and read the rest of
the book.
You might miss out on the interesting material in
some of the Quick Quizzes, but the rest of the book
has lots of good material as well.
This is an eminently reasonable approach if your main
goal is to gain a general understanding of the material
or if you are skimming through to book to find a
solution to a specific problem.
\item If you find the Quick Quizzes distracting but impossible
to ignore, you can always clone the latex source for
this book from the git archive.
Then modify \co{Makefile} and \co{qqz.sty} to eliminate
the Quick Quizzes from the PDF output.
Alternatively, you could modify these two files so as
to pull the answers inline, immediately following
the questions.
\item Look at the answer immediately rather than investing
a large amount of time in coming up with your own
answer.
This approach is reasonable when a given Quick Quiz's
answer holds the key to a specific problem you are
trying to solve.
This approach is also reasonable if you want a somewhat
deeper understanding of the material, but when you do not
expect to be called upon to generate parallel solutions given
only a blank sheet of paper.
\item If you prefer a more academic and rigorous treatment of
parallel programming,
you might like Herlihy's and Shavit's
textbook~\cite{HerlihyShavit2008Textbook}.
This book starts with an interesting combination
of low-level primitives at high levels of abstraction
from the hardware, and works its way through locking
and simple data structures including lists, queues,
hash tables, and counters, culminating with transactional
memory.
\item If you would like an academic treatment of parallel
programming from a programming-language-pragmatics viewpoint,
you might be interested in the concurrency chapter from Scott's
textbook~\cite{MichaelScott2006Textbook}
on programming languages.
\item If you are interested in an object-oriented patternist
treatment of parallel programming focussing on C++,
you might try Volumes~2 and 4 of Schmidt's POSA
series~\cite{SchmidtStalRohnertBuschmann2000v2Textbook,
BuschmannHenneySchmidt2007v4Textbook}.
Volume 4 in particular has some interesting chapters
applying this work to a warehouse application.
The realism of this example is attested to by
the section entitled ``Partitioning the Big Ball of Mud'',
wherein the problems inherent in parallelism often
take a back seat to the problems inherent in getting
one's head around a real-world application.
\item If your primary focus is scientific and technical computing,
and you prefer a patternist approach,
you might try Mattson et al.'s
textbook~\cite{Mattson2005Textbook}.
It covers Java, C/C++, OpenMP, and MPI.
Its patterns are admirably focused first on design,
then on implementation.
\item If you are interested in POSIX Threads, you might take
a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}.
\item If you are interested in C++, but in a Windows environment,
you might try Herb Sutter's ``Effective Concurrency''
series in
Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}.
This series does a reasonable job of presenting a
commonsense approach to parallelism.
\item If you want to try out Intel Threading Building Blocks,
then perhaps James Reinders's book~\cite{Reinders2007Textbook}
is what you are looking for.
\item Finally, those preferring to work in Java might be
well-served by Doug Lea's
textbooks~\cite{DougLea1997Textbook,Goetz2007Textbook}.
\end{enumerate}
In contrast, this book meshes real-world machines with real-world
algorithms.
If your sole goal is to find (say) an optimal parallel queue, you
might be better served by one of the above books.
However, if you are interested in principles of parallel design
that allow multiple such queues to operate in parallel, read on!
Coming back to the topic of Quick Quizzes, if you need a deep
understanding of the material, then you might well need to
learn to tolerate the Quick Quizzes.
Don't get me wrong, passively reading the material can be quite
valuable, but gaining full problem-solving capability really
does require that you practice solving problems.
I learned this the hard way during coursework for my late-in-life
Ph.D.
I was studying a familiar topic, and was surprised at how few of
the chapter's exercises I could solve off the top of my head.
Forcing myself to answer the questions greatly increased my
retention of the material.
So with these Quick Quizzes I am not asking you to do anything
that I have not been doing myself!
\QuickQAC{chp:Hardware and its Habits}{Hardware and its Habits}
\QuickQ{}
Why should parallel programmers bother learning low-level
properties of the hardware?
Wouldn't it be easier, better, and more general to remain at
a higher level of abstraction?
\QuickA{}
It might well be easier to ignore the detailed properties of
the hardware, but in most cases it would be quite foolish
to do so.
If you accept that the only purpose of parallelism is to
increase performance, and if you further accept that
performance depends on detailed properties of the hardware,
then it logically follows that parallel programmers are going
to need to know at least a few hardware properties.
This is the case in most engineering disciplines.
Would \emph{you} want to use a bridge designed by an
engineer who did not understand the properties of
the concrete and steel making up that bridge?
If not, why would you expect a parallel programmer to be
able to develop competent parallel software without at least
\emph{some} understanding of the underlying hardware?
\QuickQ{}
What types of machines would allow atomic operations on
multiple data elements?
\QuickA{}
One answer to this question is that it is often possible to
pack multiple elements of data into a single machine word,
which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional
memory~\cite{DBLomet1977SIGSOFT}.
However, such machines are still research curiosities,
although as of early 2012 it appears that commodity systems
supporting limited forms of hardware transactional memory
will be commercially available within a couple of years.
The jury is still out on the applicability of software transactional
memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT,
ChistopherJRossbach2007a,CalinCascaval2008tmtoy,
AleksandarDragovejic2011STMnotToy,AlexanderMatveev2012PessimisticTM}.
\QuickQ{}
So have CPU designers also greatly reduced the overhead of
cache misses?
\QuickA{}
Unfortunately, not so much.
There has been some reduction given constant numbers of CPUs,
but the finite speed of light and the atomic nature of
matter limits their ability to reduce cache-miss overhead
for larger systems.
Section~\ref{sec:cpu:Hardware Free Lunch?}
discusses some possible avenues for possible future progress.
\QuickQ{}
This is a \emph{simplified} sequence of events?
How could it \emph{possibly} be any more complex?
\QuickA{}
This sequence ignored a number of possible complications,
including:
\begin{enumerate}
\item Other CPUs might be concurrently attempting to perform
CAS operations involving this same cacheline.
\item The cacheline might have been replicated read-only in
several CPUs' caches, in which case, it would need to
be flushed from their caches.
\item CPU~7 might have been operating on the cache line when
the request for it arrived, in which case CPU~7 might
need to hold off the request until its own operation
completed.
\item CPU~7 might have ejected the cacheline from its cache
(for example, in order to make room for other data),
so that by the time that the request arrived, the
cacheline was on its way to memory.
\item A correctable error might have occurred in the cacheline,
which would then need to be corrected at some point before
the data was used.
\end{enumerate}
Production-quality cache-coherence mechanisms are extremely
complicated due to these sorts of considerations.
\QuickQ{}
Why is it necessary to flush the cacheline from CPU~7's cache?
\QuickA{}
If the cacheline was not flushed from CPU~7's cache, then
CPUs~0 and 7 might have different values for the same set
of variables in the cacheline.
This sort of incoherence would greatly complicate parallel
software, and so hardware architects have been convinced to
avoid it.
\QuickQ{}
Surely the hardware designers could be persuaded to improve
this situation!
Why have they been content with such abysmal performance
for these single-instruction operations?
\QuickA{}
The hardware designers \emph{have} been working on this
problem, and have consulted with no less a luminary than
the physicist Stephen Hawking.
Hawking's observation was that the hardware designers have
two basic problems~\cite{BryanGardiner2007}:
\begin{enumerate}
\item the finite speed of light, and
\item the atomic nature of matter.
\end{enumerate}
\begin{table}
\centering
\begin{tabular}{l||r|r}
Operation & Cost (ns) & Ratio \\
\hline
\hline
Clock period & 0.4 & 1.0 \\
\hline
``Best-case'' CAS & 12.2 & 33.8 \\
\hline
Best-case lock & 25.6 & 71.2 \\
\hline
Single cache miss & 12.9 & 35.8 \\
\hline
CAS cache miss & 7.0 & 19.4 \\
\hline
Off-Core & & \\
\hline
Single cache miss & 31.2 & 86.6 \\
\hline
CAS cache miss & 31.2 & 86.5 \\
\hline
Off-Socket & & \\
\hline
Single cache miss & 92.4 & 256.7 \\
\hline
CAS cache miss & 95.9 & 266.4 \\
\hline
Comms Fabric & 4,500 & 7,500 \\
\hline
Global Comms & 195,000,000 & 324,000,000 \\
\end{tabular}
\caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\end{table}
The first problem limits raw speed, and the second limits
miniaturization, which in turn limits frequency.
And even this sidesteps the power-consumption issue that
is currently holding production frequencies to well below
10 GHz.
Nevertheless, some progress is being made, as may be seen
by comparing
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
with
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}
on
page~\pageref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the
confines of a single core or single die.
There has been some improvement in overall system latency,
but only by about a factor of two.
Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few years.
Section~\ref{sec:cpu:Hardware Free Lunch?}
looks at what else hardware designers might be
able to do to ease the plight of parallel programmers.
\QuickQ{}
These numbers are insanely large!
How can I possibly get my head around them?
\QuickA{}
Get a roll of toilet paper.
In the USA, each roll will normally have somewhere around 350-500
sheets.
Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single
CAS cache miss.
For the more-expensive inter-system communications latencies,
use several rolls (or multiple cases) of toilet paper to represent
the communications latency.
Important safety tip: make sure to account for the needs of
those you live with when appropriating toilet paper!
\QuickQ{}
But individual electrons don't move anywhere near that fast,
even in conductors!!!
The electron drift velocity in a conductor under the low voltages
found in semiconductors is on the order of only one \emph{millimeter}
per second.
What gives???
\QuickA{}
Electron drift velocity tracks the long-term movement of individual
electrons.
It turns out that individual electrons bounce around quite
randomly, so that their instantaneous speed is very high, but
over the long term, they don't move very far.
In this, electrons resemble long-distance commuters, who
might spend most of their time traveling at full highway
speed, but over the long term going nowhere.
These commuters' speed might be 70 miles per hour
(113 kilometers per hour), but their long-term drift velocity
relative to the planet's surface is zero.
When designing circuitry, electrons' instantaneous speed is
often more important than their drift velocity.
When a voltage is applied to a wire, more electrons enter the
wire than leave it, but the electrons entering cause the
electrons already there to move a bit further down the wire,
which causes other electrons to move down, and so on.
The result is that the electric field moves quite quickly down
the wire.
Just as the speed of sound in air is much greater than is
the typical wind speed, the electric field propagates down
the wire at a much higher velocity than the electron drift
velocity.
\QuickQ{}
Given that distributed-systems communication is so horribly
expensive, why does anyone bother with them?
\QuickA{}
There are a number of reasons:
\begin{enumerate}
\item Shared-memory multiprocessor systems have strict size limits.
If you need more than a few thousand CPUs, you have no
choice but to use a distributed system.
\item Extremely large shared-memory systems tend to be
quite expensive and to have even longer cache-miss
latencies than does the small four-CPU system
shown in
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
\item The distributed-systems communications latencies do
not necessarily consume the CPU, which can often allow
computation to proceed in parallel with message transfer.
\item Many important problems are ``embarrassingly parallel'',
so that extremely large quantities of processing may
be enabled by a very small number of messages.
SETI@HOME~\cite{SETIatHOME2008}
is but one example of such an application.
These sorts of applications can make good use of networks
of computers despite extremely long communications
latencies.
\end{enumerate}
It is likely that continued work on parallel applications will
increase the number of embarrassingly parallel applications that
can run well on machines and/or clusters having long communications
latencies.
That said, greatly reduced hardware latencies would be an
extremely welcome development.
\QuickQ{}
OK, if we are going to have to apply distributed-programming
techniques to shared-memory parallel programs, why not just
always use these distributed techniques and dispense with
shared memory?
\QuickA{}
Because it is often the case that only a small fraction of
the program is performance-critical.
Shared-memory parallelism allows us to focus distributed-programming
techniques on that small fraction, allowing simpler shared-memory
techniques to be used on the non-performance-critical bulk of
the program.
\QuickQAC{chp:Tools of the Trade}{Tools of the Trade}
\QuickQ{}
But this silly shell script isn't a \emph{real} parallel program!
Why bother with such trivia???
\QuickA{}
Because you should \emph{never} forget the simple stuff!
Please keep in mind that the title of this book is
``Is Parallel Programming Hard, And, If So, What Can You Do About It?''.
One of the most effective things you can do about it is to
avoid forgetting the simple stuff!
After all, if you choose to do parallel programming the hard
way, you have no one but yourself to blame.
\QuickQ{}
Is there a simpler way to create a parallel shell script?
If so, how? If not, why not?
\QuickA{}
One straightforward approach is the shell pipeline:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
grep $pattern1 | sed -e 's/a/b/' | sort
\end{verbatim}
\end{minipage}
\vspace{5pt}
For a sufficiently large input file,
\co{grep} will pattern-match in parallel with \co{sed}
editing and with the input processing of \co{sort}.
See the file \co{parallel.sh} for a demonstration of
shell-script parallelism and pipelining.
\QuickQ{}
But if script-based parallel programming is so easy, why
bother with anything else?
\QuickA{}
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based.
However, script-based parallelism does have its limitations:
\begin{enumerate}
\item Creation of new processes is usually quite heavyweight,
involving the expensive \co{fork()} and \co{exec()}
system calls.
\item Sharing of data, including pipelining, typically involves
expensive file I/O.
\item The reliable synchronization primitives available to
scripts also typically involve expensive file I/O.
\end{enumerate}
These limitations require that script-based parallelism use
coarse-grained parallelism, with each unit of work having
execution time of at least tens of milliseconds, and preferably
much longer.
Those requiring finer-grained parallelism are well advised to
think hard about their problem to see if it can be expressed
in a coarse-grained form.
If not, they should consider using other parallel-programming
environments, such as those discussed in
Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}.
\QuickQ{}
Why does this \co{wait()} primitive need to be so complicated?
Why not just make it work like the shell-script \co{wait} does?
\QuickA{}
Some parallel applications need to take special action when
specific children exit, and therefore need to wait for each
child individually.
In addition, some parallel applications need to detect the
reason that the child died.
As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive},
it is not hard to build a \co{waitall()} function out of
the \co{wait()} function, but it would be impossible to
do the reverse.
Once the information about a specific child is lost, it is lost.
\QuickQ{}
Isn't there a lot more to \co{fork()} and \co{wait()}
than discussed here?
\QuickA{}
Indeed there is, and
it is quite possible that this section will be expanded in
future versions to include messaging features (such as UNIX
pipes, TCP/IP, and shared file I/O) and memory mapping
(such as \co{mmap()} and \co{shmget()}).
In the meantime, there are any number of textbooks that cover
these primitives in great detail,
and the truly motivated can read manpages, existing parallel
applications using these primitives, as well as the
source code of the Linux-kernel implementations themselves.
\QuickQ{}
If the \co{mythread()} function in
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
can simply return, why bother with \co{pthread_exit()}?
\QuickA{}
In this simple example, there is no reason whatsoever.
However, imagine a more complex example, where \co{mythread()}
invokes other functions, possibly separately compiled.
In such a case, \co{pthread_exit()} allows these other functions
to end the thread's execution without having to pass some sort
of error return all the way back up to \co{mythread()}.
\QuickQ{}
If the C language makes no guarantees in presence of a data
race, then why does the Linux kernel have so many data races?
Are you trying to tell me that the Linux kernel is completely
broken???
\QuickA{}
Ah, but the Linux kernel is written in a carefully selected
superset of the C language that includes special gcc
extensions, such as asms, that permit safe execution even
in presence of data races.
In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic.
For an example, consider embedded systems with 32-bit pointers
and 16-bit busses.
On such a system, a data race involving a store to and a load
from a given pointer might well result in the load returning the
low-order 16 bits of the old value of the pointer concatenated
with the high-order 16 bits of the new value of the pointer.
\QuickQ{}
What if I want several threads to hold the same lock at the
same time?
\QuickA{}
The first thing you should do is to ask yourself why you would
want to do such a thing.
If the answer is ``because I have a lot of data that is read
by many threads, and only occasionally updated'', then
POSIX reader-writer locks might be what you are looking for.
These are introduced in
Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use \co{pthread_create()} to create the other threads.
The question of why this would ever be a good idea is left
to the reader.
\QuickQ{}
Why not simply make the argument to \co{lock_reader()}
on line~5 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
be a pointer to a \co{pthread_mutex_t}?
\QuickA{}
Because we will need to pass \co{lock_reader()} to
\co{pthread_create()}.
Although we could cast the function when passing it to
\co{pthread_create()}, function casts are quite a bit
uglier and harder to get right than are simple pointer casts.
\QuickQ{}
Writing four lines of code for each acquisition and release
of a \co{pthread_mutex_t} sure seems painful!
Isn't there a better way?
\QuickA{}
Indeed!
And for that reason, the \co{pthread_mutex_lock()} and
\co{pthread_mutex_unlock()} primitives are normally wrapped
in functions that do this error checking.
Later on, we will wrapper them with the Linux kernel
\co{spin_lock()} and \co{spin_unlock()} APIs.
\QuickQ{}
Is ``x = 0'' the only possible output from the code fragment
shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}?
If so, why?
If not, what other output could appear, and why?
\QuickA{}
No.
The reason that ``x = 0'' was output was that \co{lock_reader()}
acquired the lock first.
Had \co{lock_writer()} instead acquired the lock first, then
the output would have been ``x = 3''.
However, because the code fragment started \co{lock_reader()} first
and because this run was performed on a multiprocessor,
one would normally expect \co{lock_reader()} to acquire the
lock first.
However, there are no guarantees, especially on a busy system.
\QuickQ{}
Using different locks could cause quite a bit of confusion,
what with threads seeing each others' intermediate states.
So should well-written parallel programs restrict themselves
to using a single lock in order to avoid this kind of confusion?
\QuickA{}
Although it is sometimes possible to write a program using a
single global lock that both performs and scales well, such
programs are exceptions to the rule.
You will normally need to use multiple locks to attain good
performance and scalability.
One possible exception to this rule is ``transactional memory'',
which is currently a research topic.
Transactional-memory semantics can be loosely thought of as those
of a single global lock with optimizations permitted and
with the addition of rollback~\cite{HansJBoehm2009HOTPAR}.
\QuickQ{}
In the code shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks},
is \co{lock_reader()} guaranteed to see all the values produced
by \co{lock_writer()}?
Why or why not?
\QuickA{}
No.
On a busy system, \co{lock_reader()} might be preempted
for the entire duration of \co{lock_writer()}'s execution,
in which case it would not see \emph{any} of \co{lock_writer()}'s
intermediate states for \co{x}.
\QuickQ{}
Wait a minute here!!!
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
didn't initialize shared variable \co{x},
so why does it need to be initialized in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}?
\QuickA{}
See line~3 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
Because the code in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
ran first, it could rely on the compile-time initialization of
\co{x}.
The code in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
ran next, so it had to re-initialize \co{x}.
\QuickQ{}
Isn't comparing against single-CPU throughput a bit harsh?
\QuickA{}
Not at all.
In fact, this comparison was, if anything, overly lenient.
A more balanced comparison would be against single-CPU
throughput with the locking primitives commented out.
\QuickQ{}
But 1,000 instructions is not a particularly small size for
a critical section.
What do I do if I need a much smaller critical section, for
example, one containing only a few tens of instructions?
\QuickA{}
If the data being read \emph{never} changes, then you do not
need to hold any locks while accessing it.
If the data changes sufficiently infrequently, you might be
able to checkpoint execution, terminate all threads, change
the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per
thread, so that a thread read-acquires the larger aggregate
reader-writer lock by acquiring its own lock, and write-acquires
by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}.
This can work quite well for readers, but causes writers
to incur increasingly large overheads as the number of threads
increases.
Some other ways of handling very small critical sections are
described in Section~\ref{sec:defer:Read-Copy Update (RCU)}.
\QuickQ{}
In
Figure~\ref{fig:intro:Reader-Writer Lock Scalability},
all of the traces other than the 100M trace deviate gently