//// Add something along the lines of, "You’ll recall our discussion about locality. Remember how we learned that x, y, and z have…? Now you’ll be able to see that in action." Amy////
Ever since there were
[] [] [] [] [] [] [] [] eight computers and a network,
programmers have wished that
eight computers [] [] [] [] [] [] [] [] [] solved problems twice as fast as four computers [] [] [] []
The problem comes
when the computer over here \/ [] [] [] [] [] [] [] needs data from [] [] [] [] [] [] [] <- the one over here [] [] [] [] []<+]-+]--- or the one here and this one -> [] [] [] [] [] [] [] needs data from this one ^^ and so forth.
In the example of the Elves, the data was at points organized in three different ways:
-
In mailbags corresponding to the post office and date each letter was mailed
-
In pygmy-elephant batches according to toy type, sorted by toy detail.
-
In sacks of gifts for each household, ordered by Santa’s delivery path.
The difficulty the elves faced wasn’t in processing the records. Workforms and Gift Making, that’s what they were born to do. Their problem was first, that toy requests were organized by mail route, not toy type; and next, that the gifts for each household were organized by toy type, not geography. In both cases, the job of the chimps was to transform the data and then label it by its key locality: in these examples, toy type and toy detail, or delivery path and household. //// This seems unclear. Are you saying that toy type and toy deail ARE the key localities? This is an important point to clearly define. Amy////
//// Correlate the above to equivalent real-world scenarios and explain the "key localities" of those examples too. Amy////
This book is fundamentally about just that thing: helping you identify the key locality of a data problem. /// Something like, "…in other words, the common center of a data problem. Or, locality could refer to the universals in a set, such as blood type or eye color in the case of…" (I don’t know if that is correct but I hope you get what I’m suggesting with these examples.) Amy//// Now for folks who learn by reading, what we mean by "key locality" is starting to make sense, and so I’ll give you some brief examples. For folks who learn by doing, you might want to read some more examples and then
-
word count: You can count the occurrence of every word in a large body of text by a) grouping all ocurrences of each word together b) counting each group. In this case, the key locality is the word itself. // the words are initially organized by position within the text;
-
total sort: To sort a large number of names, group each name by its first few letters (eg "Aaron" = "aa", "Zoe" = "zo"), making each group small enough to efficiently sort in memory. Reading the output of each group in the order of its label will produce the whole dataset in order. The key locality is to partition by prefix and sort by name.
-
network graph statistics: Counting the average number of Twitter messages per day in each user’s timeline requires two steps. First, take every 'A follows B' link and get it into the same locality as the user record for its "B". Next, group all those links by the user record for 'A', and sum their messages-per-day.
-
correlation: Correlate stock prices with pageview counts of each corporation’s Wikipedia pages: bring the stock prices and pageview counts for each stock symbol together, and sort them together by time.
//// I’m wondering if it would be worthwhile for readers to get the best possible hang of locality if you were to guide them through actually coding the above four? That would be helpful for the folks who learn by doing, as you refer to above. Amy////
You can try to make it efficient for any computer to talk to any other computer. But it requires top-of-the-line hardware, and clever engineers to build it, and a high priesthood to maintain it, and this attracts project managers, which means meetings, and soon everything is quite expensive, so expensive that only nation states and huge institutions can afford to do so. This of course means you can only use these expensive supercomputer for Big Important Problems — so unless you take drastic actions, like joining the NSA or going to grad school, you can’t get to play on them.
Instead of being clever, be simple.
Map/Reduce proposes this fair bargain. You must agree to write only one type of program, one that’s as simple and constrained as a haiku.
data flutters by elephants make sturdy piles insight shuffles forth
If you do so, Hadoop will provide near-linear scaling on massive numbers of machines, a framework that hides and optimizes the complexity of distributed computing, a rich ecosystem of tools, and one of the most adorable open-source project mascots to date.
More prosaically,
-
label — turn each input record into any number of labelled records
-
group/sort — hadoop groups those records uniquely under each label, in a sorted order
-
reduce — for each group, process its records in order; emit anything you want.
The trick lies in the 'group/sort' step: assigning the same label to two records in the 'label' step ensures that they will become local in the reduce step.
Let’s join back up with the Chimpanzee and Elephant Shipping Company and see this in practice.