Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arithmetic coding demo #631

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

cyntsh
Copy link
Contributor

@cyntsh cyntsh commented Aug 25, 2021

In this demo, recursive interval subdivision for arithmetic coding is reformulated using fold. Side note: I also found the types very nice to reason with, esp. Interval, which reduces the algorithm to manipulating intervals.

@google-cla google-cla bot added the cla: yes label Aug 25, 2021
First, model the probability of each letter given by the string to be encoded.

def cumProb (ps: n=>Float) : n=>Float =
withState 0.0 \total.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be computed with Accum instead of state.


def getFrequency (str: (Fin l)=>Word8) : Alphabet=>Int =
a: Alphabet => Int = zero
yieldState a \ref. for i.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can also be computed with a parallel Accum.


'### Demo: Lossless compression on a test string

str' = "abbadcabccdd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a longer test? I'm not convinced that there aren't lurking floating-point issues in this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A longer test here would fail. I could change all instances of Float to Float64 for more precision and pass the longer test, but the code would look bloated. i.e. this line,
top:Interval = (0., 1.)
would become
top:Interval = (FToF64 0.,FToF64 1.)
Is there a way to define a Float to arbitrary precision? I considered using integer arithmetic instead of floating-point arithmetic for better control over precision, but that code is still a WIP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Would an even longer test then fail even if you used F64? I get that this is just a demo, but if it only works for short sequences we should put some warnings in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll probably make a more serious attempt at integer arithmetic before I commit to this implementation and add warnings.


def getProbability (l: Int) (freq: Alphabet=>Int) : Alphabet=>(Float&Float) =
probs = for i. IToF freq.i / IToF l
cums = cumProb probs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to repeat work every time it's called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probabilities are cached on line 92:
p = getProbability l $ getFrequency str
So it's only calculated once.

True -> Continue
False ->
(x, w) = rule.(j@_) in
case code >= x && code < (x+w) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a binary search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually just linear search on intervals, following after most code implementations I've seen. But maybe it'd scale better to larger alphabets if binary search were implemented instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use searchSorted from the prelude:
https://github.com/google-research/dex-lang/blob/main/lib/prelude.dx#L1620

the encoded letter.
The decoding process retraces the steps of the encoding process to recover the correct letters.

def encode (str: (Fin l)=>Word8) (rule: Alphabet=>(Interval->Interval)) : Float =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Fin l and not just in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets us this error:

Type error:
Expected: ((Fin a) => Word8)
  Actual: (in => Word8)
(Solving for: [a:Int32])

  update = subdivide str rule

But yeah, it does look more succinct with in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Well it's not a big deal either way.

@j-towns
Copy link

j-towns commented Nov 10, 2021

@cyntsh are you familiar with asymmetric numeral systems (ANS)? This is a more recent and nowadays more widely used alternative to arithmetic coding, and on a high level the only difference is that AC is 'queue-like', or first-in-first-out, whereas ANS is 'stack-like', or last-in-first-out.

The power of ANS is roughly equivalent to that of AC, but it is significantly easier to implement (I'm talking here about an 'exact' implementation in terms of integer arithmetic). In case you're interested, I've written a short (< 50 lines), pure functional, pure Python (no imports) ANS implementation which I imagine would not be too difficult to port into Dex: https://github.com/j-towns/ans-notes/blob/master/rans.py.

@cyntsh
Copy link
Contributor Author

cyntsh commented Nov 10, 2021

Thanks for the suggestion, @j-towns ! Your ANS implementation looks incredibly compact - I'll have a read and give it a try sometime this week.

@j-towns
Copy link

j-towns commented Jan 6, 2022

@cyntsh how come you can't use the Word types and bit-wise operations defined in the Prelude? (these ones)

@cyntsh
Copy link
Contributor Author

cyntsh commented Jan 7, 2022

@j-towns good point, that does improve the compression accuracy, but still not quite at the level of your python implementation. I suspect that it has to do with comparators (<, >) when the Word64-type integers get large enough. After writing a Word64 instance for Ord, here’s what I get:

w: Word64 = (one .<<. 63)
:p w
> 0x8000000000000000
:p w > zero
> False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants