Elucidation of implicit language models from common data compression algorithms.
A codec is an algorithm for compressing and decompressing data, often of a specific modality such as text or video. Codecs consist of a encoder/code, which describes data in a more concise form, and a decoder which reconstructs data from its encoding. We'll be concerning ourselves mainly with the former.
Let
According to the Kraft-McMillan inequality, if
Now, since
Let's be a little more precise about what we're attempting to do.
Given some code
-
At each time step
$t$ , we compute the distribution$p_C(x_1,\dots,x_t) = 2^{-{L_C(x_1,\dots,x_t)}}$ . We then use top-k sampling to redistribute the probability mass amoungst$k$ most likely sequences. -
We then randomly sample from the top-k distribution and return to step 1.
Our method successfully helped elucidate the implicit statistical assumptions in coding schemes such as LZMA and RLE. This form of feedback may aid in creating more effective coding schemes in future.