-
Notifications
You must be signed in to change notification settings - Fork 9
/
erasure.tex
186 lines (144 loc) · 16.5 KB
/
erasure.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
First, in \ref{sec:error-correcting-codes}, we introduce \glossplural{erasure code}. We then walk through in \ref{sec:erasure-bmt} how they are applied to files in Swarm. In \ref{sec:dispersed-replicas}, we present a construct that enables cross-neighbourhood redundancy for singleton chunks that completes erasure coding. Finally, in \ref{sec:strategies}, we explore systematic codes that facilitate various retrieval strategies of erasure-coded files, while preserving random access capabilities.
\subsection{Error correcting codes \statusgreen}\label{sec:error-correcting-codes}
% \enlargethispage*{\baselineskip}
%
\emph{Error correcting codes} are widely utilised in the context of data storage and transfer to ensure data integrity even in the face of a system fault. Error correction schemes define how to rearrange the original data by adding redundancy to its representation before upload or transmission (\emph{encoding}\,) so that it can correct corrupted data or recover missing content upon retrieval or reception (\emph{decoding}\,). The different schemes are evaluated by quantifying their strength (\emph{tolerance}, in terms of the rate of data corruption and loss) as a function of their cost (\emph{overhead}, in terms of storage and computation).
In the context of computer hardware architecture, synchronising arrays of disks is crucial to provide resilient storage in data centres.
In \emph{erasure coding}, %
%
%\footnote{\cite{buterin2014erasure}}
%
in particular, the problem can be framed as follows: How does one encode the stored data into shards distributed across the disks so that the data remains fully recoverable in the face of an arbitrary probability that any one disk becomes faulty?
Similarly, in the context of Swarm's distributed immutable chunk store, the problem can be reformulated as follows: How does one encode the stored data into chunks distributed across neighbourhoods in the network so that the data remains fully recoverable in the face of an arbitrary probability that any one chunk is not retrievable?%
%
\footnote{%
It is safe to assume that the retrieval of any one chunk will fail with equal and independent probability.}
\gloss{Reed-Solomon coding} (\gloss{RS}) \citep{lubyetal1995CRS,plank2006optimizing,li2013erasure}
is the father of all error correcting codes and also the most widely used and implemented.%
%
\footnote{%
For a thorough comparison of an earlier generation of implementations of RS, see \citet{plank2009performance}.}
%
When applied to data of $m$ fixed-size blocks (message of length $m$), it produces an encoding of $m+k$ \emph{codewords} (blocks of the same size)
in such a way that having any $m$ out of $m+k$ blocks is enough to reconstruct the original data. Conversely, $k$ puts an upper bound on the number of \emph{erasures} allowed (number of blocks unavailable) for full recoverability, i.e., it expresses (the maximum) \emph{loss tolerance}.%
%
\footnote{Error correcting codes that has a focus on correcting data loss are referred to as \emph{erasure codes}, a typical scheme of choice for distributed storage systems \citep{balaji2018erasure}.}
%
$k$ is also the count of \emph{parities}, quantifying the data blocks added during the encoding on top of the original volume, i.e., it expresses \emph{storage overhead}. While RS is, therefore, optimal for storage (since loss tolerance cannot exceed the storage overhead),
it has high bandwidth demands%
%
\footnote{Both the encoding and the decoding of RS codes takes $O(mk)$ time (with $m$ data chunks and $k$ parities). However, we found computational overhead insignificant in the context of chunk retrieval happening via network transfer.}
%
for local repair processes.%
%
\footnote{Entanglement codes \citep{estrada2018alpha, estrada2019building} require a minimal bandwidth overhead for a local repair, but at the cost of storage overhead that is in multiples of $100\%$.}
%
The decoder needs to retrieve $m$ chunks to recover a particular unavailable chunk.
Hence, ideally, RS is used on files which are supposed to be downloaded in full,%
%
\footnote{Or in fragments large enough to include the data span over which the encoding is defined, such as videos.}
%
but it is inappropriate for use cases needing only local repairs.%
%
\footnote{Use cases requiring random access to small amounts of data (e.g., path lookup) benefit from simple replication to optimise on bandwidth, which is suboptimal in terms of storage \citep{weatherspoon2002erasure}.}
When using RS, it is customary to use \emph{systematic} encoding, which means that the original data forms part of the encoding, i.e., the parities are actually added to it.%
%
%
%\footnote{Our library of choice implementing exactly such a scheme is \url{https://github.com/klauspost/reedsolomon}.}
%
%
\subsection{Erasure coding in the Swarm hash tree \statusgreen}
\label{sec:erasure-bmt}
Swarm uses the \emph{Swarm hash tree} to represent files. This structure is a Merkle tree \citep{merkle1980protocols}, whose leaves are the consecutive segments of the input data stream. These segments are turned into chunks and are distributed among the Swarm nodes for storage. The consecutive chunk references (either in the form of an address or an address and an encryption key) are written into a chunk on a higher level.
These so-called \emph{packed address chunks} (PACs) constitute the intermediate chunks of the tree.
The branching factor $b$ is chosen so that the references to its children fill up a full chunk.
With a reference size of 32 or 64 (hash size 32) and a chunk size of 4096 bytes, $b$ is 128 for unencrypted, and 64 for encrypted content
(see Figure \ref{fig:Swarm-hash-split}).
\begin{figure}[!ht]
\centering
\input{fig/Swarm-hash-split.tex}
\caption[Swarm hash split \statusgreen]{The Swarm tree is the data structure encoding how a document is split into chunks.}
\label{fig:Swarm-hash-split}
\end{figure}
Note that on the right edge of the hash tree, the last chunk of each level may be shorter than 4K: in fact, unless the file is exactly $4\cdot b^n$ kilobytes long, there is always at least one \emph{incomplete chunk}. Importantly, it makes no sense to wrap a single chunk reference in a PAC, so it is attached to the first level where there is open chunks. Such \emph{"dangling" chunks} will appear if and only if the file has a zero digit in its $b$-ary representation.
During file retrieval, a Swarm client starts from the root hash reference and retrieves the corresponding chunk. Interpreting the metadata as encoding the span of data subsumed under the chunk, it decides that the chunk is a PAC if the span exceeds the maximum chunk size.
In case of standard file download, all the references packed within the PAC are followed, i.e., the referenced chunk data is retrieved.
PACs offer a natural and elegant way to achieve consistent redundancy within the swarm hash tree.
The input data for an instance of erasure coding is the chunk data of the children, with the equal-sized bins corresponding to the chunk data of the consecutive references packed into it. The idea is that instead of having each of the $b$ references packed represent children, only $m$ would, and the rest of the $k=b-m$ would encode RS parities (see Figure \ref{fig:Swarm-hash-erasure}).
The \emph{chunker} algorithm that incorporates PAC-scoped RS encoding would work as follows:
\begin{enumerate}[noitemsep]
\item Set the input to the actual data level and produce a sequence of chunks from the consecutive 4K segments of the data stream. Choose $m$ and $k$ such that $m+k=b$ is the branching factor (128 for unencrypted, and 64 for encrypted content).
\item Read the input one chunk at a time. Count the chunks by incrementing a counter $i$.
\item Repeat Step 2 until either $i = m$ or there's no more data left.
\item Use the RS scheme on the last $i\leq m$ chunks to produce $k$ parity chunks resulting in a total of $n = i+k \leq b$ chunks.
\item Concatenate the references of all these chunks to result in a packed address chunk (of size $h\cdot n$ of the next level) of the level above. If this is the first chunk on that level, set the input to this level and spawn this same procedure from Step 2.
\item When the input is consumed, signal the end of input to the next level and quit the routine. If there is no next level, record the single chunks as the root chunk and use the reference to refer to the entire file.
\end{enumerate}
\begin{figure}[htbp]
\centering
\resizebox{1\textwidth}{!}{
\input{fig/Swarm-hash-erasure.tex}
}
\caption[Swarm hash erasure \statusgreen]{The Swarm tree with extra parity chunks using $m=112$ out of $n=128$ RS encoding. Chunks $P_{0}$ through $P_{15}$ are parity data for chunks $H_0 $ through $H_{111}$ on every level of intermediate chunks.}
\label{fig:Swarm-hash-erasure}
\end{figure}
This pattern repeats itself all the way down the tree. Thus, hashes $H_{m+1}$ through $H_{127}$ point to parity data for chunks pointed to by $H_0$ through $H_{m}$. Since parity chunks $P_i$ do not have children, the tree structure does not have uniform depth.
\subsection{Incomplete chunks and dispersed replicas}
\label{sec:dispersed-replicas}
If the number of file chunks is not a multiple of $m$, it is not possible to proceed with the last batch in the same way as the others. We propose that we encode the remaining chunks with an erasure code that guarantees at least the same level of security as the others.%
%
\footnote{Note that this is not as simple as choosing the same redundancy. For example, a $50\text{-out-of-}100$ encoding is much more secure against loss than a $1\text{-out-of-}2$ encoding, even though the redundancy is $100\%$ in both cases.}
%
Overcompensating, we still require the same number of parity chunks even when there are fewer than $m$ data chunks.
This leaves us with only one corner case: it is not possible to use our $m\text{-out-of-}n$ scheme on a single chunk ($m=1$) because it would amount to $k+1$ copies of the same chunk. The problem is that copies of the same chunk all have the same hash and therefore are automatically deduplicated. Whenever a single chunk is left over ($m=1$) (i.e., the root chunk itself), we would need to replicate the chunk in a way that (1) ideally, the replicas are dispersed in the address space in a balanced way, yet (2) their addresses can be known by retrievers who ideally only know the reference to the original chunk's address.
Our solution uses Swarm's special construct, the \emph{single owner chunk} (SOC; see~\ref{sec:single-owner-chunks}). Replicas of the root chunk are created by making the chunk data the payload of a number of SOCs. The addresses of these SOCs must be derivable from the original root hash following a deterministic convention shared by uploaders and downloaders.
The address of a SOC is the hash of its ID and the Ethereum address of its owner. In order to create valid SOCs, uploaders need to sign the SOC with the owner's identity, therefore the owner of the SOC must be a consensual identity with their private key publicly revealed.
%
\footnote{This has the added benefit that third parties can also upload replicas of any chunk.}
The other component of the address, the SOC ID, must satisfy two criteria: (1) it needs to match the payload hash up to 31 bytes and (2) it must provide the entropy needed to mine the overall chunk into a sufficient number of distinct neighbourhoods. (1) is added as a validation criterion for the special case of replica SOCs, while (2) takes care that we can find replicas uniformly dispersed within the address space.
This construct is called \emph{dispersed replica}:
Let us assume $c$ is the content-addressed chunk we need to replicate; $n$ is the number of bits of entropy available to find the nonces that generate $2^k$ perfectly balanced replicas; initialise a chunk array $\rho$ of length $2^k$ and start with $n$-bit integer $i=0$ and replica counter $C=0$.
\begin{enumerate}[noitemsep]
\item Create the SOC ID by taking $\mathit{addr}(c)$ and changing the last byte (at index position 31) to $i$.
\item Calculate the the SOC address by concatenating ID $id$ and owner $o$%
%
\footnote{The single-owner chunk representing the dispersed replicas must be signed by the arbitrary private key $\texttt{0x010000000...0000000}$. The corresponding ethereum address is
\texttt{0xdc5b20847f43d67928f49cd4f85d696b5a7617b5}.}
%
and hash the result using the Keccak256 base hash $a_i:=H(id\concat o)$, and record $c_i=\mathit{SOC}(id,o,c)$.
\item Calculate the bin this hash belongs to by taking the $k$-bit prefix as big-endian binary number $j$ between $0\leq j<2^k$.
\item If $\rho\idx{j}$ is unassigned, then let $\rho\idx{j}:=c_i$ and increment $C$.
\item If $C=2^k$, then quit.
\item Increment $i$ by one, if $i=2^n$, then quit.
\item Repeat from Step 1.
\end{enumerate}
With this solution, we are able to provide an arbitrary level of redundancy for the storage of data of any length.
%
%
\footnote{Note that if $n$ is small, then generating all $2^k$ balanced replicas may not be achievable, and if $n<k$, this is certainly not possible.
In general, given $n, k$ at least $m$ miss has a probability of $(1 - m/2^k)^{2^n}$.}
Then, depending on the strategy, the downloader can choose which address to retrieve the chunk from. The obvious choice is the replica closest to the requesting node's overlay address.
\subsection{Prefetching strategies for retrieval}
\label{sec:strategies}
When downloading, systematic per-level erasure codes allow for different \emph{prefetching strategies}:
\begin{labelledlist}
\item[\textsc{NONE} = \emph{direct with no recovery; frugal}] No prefetching takes place, RS parity chunks are ignored if present. Retrieval involves only the original chunks, no recovery.
\item[\textsc{DATA} = \emph{prefetching data but no recovery; cheap}] Prefetching data-only chunks, RS parity chunks are ignored if present, no recovery.
\item[\textsc{PROX} = \emph{distance-based selection; cheap}] For all intermediate chunks, first retrieve $ m$ chunks that are expected to be the fastest to download (e.g., the $m$ closest to the node).
\item[\textsc{RACE} = \emph{latency optimised; expensive}] Initiate requests for all chunks within the scope (max $m+k$) and will need to wait only for the first $m$ chunks to be delivered in order to proceed. This is equivalent to saying that the $k$ slowest chunk retrievals can be ignored, therefore this strategy is optimal for latency at the expense of cost.
\end{labelledlist}
All in all, strategies using recovery can effectively overcome the occasional unavailability of chunks, be it due to faults such as network contention, connectivity gaps in the Kademlia table, node churn, overpriced neighbourhoods, or even malicious attacks targeting a specific neighbourhood.
Similarly, given a typical model of network latencies for chunk retrieval, erasure codes in \textsc{RACE} mode can guarantee an upper limit on retrieval latencies.%
%
\footnote{For instance, in the temporally sensitive case of real-time video streaming, for any quality setting (bitrate and FPS), buffering times can be guaranteed if the batch of chunks representing a time unit of media is encoded using its own scope(s) of erasure coding.}
%When downloading, systemic per-level erasure codes allow for 3 strategies of use:
%\begin{labelledlist}
%\item[\emph{direct with no recovery}] For all intermediate chunks, RS parity chunks are ignored. Retrieval involves only the original chunks, no recovery. This is the choice for random access.
%\item[\emph{selective with fallback}] For all intermediate chunks, first retrieve $ m$ chunks that are deemed to be the fastest to download (e.g., the $m$ closest to the node), and then fall back to $j\leq k$ if and only if the retrieval of $j$ out of $m$ chunks is unattainable. This saves
%some bandwidth and the corresponding costs at the expense of speed: if a selected chunk
%is missing, fallback to RS is delayed.
%\item[\emph{race}] the client initiates requests for all $m+k$ and will need to wait only for the first $m$ chunks to be delivered in order to proceed. This is equivalent to saying that the $k$ slowest chunk retrievals can be ignored, therefore this strategy is optimal for latency at the expense of cost.
%\end{labelledlist}
%This technique can effectively overcome the occasional unavailability of chunks due to faults like network contention, connectivity gaps, node churn, prohibitively overpriced neighbourhoods, or even malicious attacks targeting certain localities. Given a particular fault model for churn and throughput, erasure codes can be calibrated to
%\emph{guarantee an upper limit on retrieval latencies}, a strong service quality proposition.
%