Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo: Speech processing MFCC demo #603

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
381 changes: 381 additions & 0 deletions examples/alignment.dx
Original file line number Diff line number Diff line change
@@ -0,0 +1,381 @@
import fft
import parser
import plot

' # Speech Processing

' This notebook describes the basic steps behind speech processing.
The goal will be to walk through the process from reading in a wave
file to being ready to train a full speech recognition system.


' ## Wave Files
To begin we need to be able to read in .wav files. To do this
we follow the wave file spec and define a header and the raw
data format.

' [Wave Format Documentation](https://docs.fileformat.com/audio/wav/)


data WavHeader =
AsWavHeader {riff:(Fin 4=>Char) &
size:Int &
type:(Fin 4=>Char) &
chunk:(Fin 4=>Char) &
formatlength:Int &
channels:Int &
samplerate:Int &
bitspersample:Int &
datasize:Int}

data Wav = AsWav header:WavHeader dat:(List Int)

' ### Binary Manipulation
We will need a couple additional binary conversion functions. (These are naive for now.)

z = FToI (pow 2.0 15.0)
def W32ToI (x : Word32): Int =
y:Int = internalCast _ x
select (y <= z) y ((-1)*z + (y- z))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that meant to reinterpret the Word32 as a twos complement encoded integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is super hacky, but I couldn't come up with a better way. Wav file stores int16's in this format.


def W8ToW32 (x : Word8): Word32 = internalCast _ x

def bytesToInt32 (chunk : Fin 4 => Byte) : Int =
[a, b, c, d] = map W8ToW32 chunk
W32ToI ((d .<<. 24) .|. (c .<<. 16) .|. (b .<<. 8) .|. a)


def bytesToInt16 (chunk : Fin 2 => Byte) : Int =
[a, b] = map W8ToW32 chunk
W32ToI ((b .<<. 8) .|. a)


' ### Parsing Types

' We also add a couple additional parser helpers.

def parseGroup (parser: Parser a) : (Parser (m => a)) = MkParser \h. for i. parse h parser
parse4: Parser (Fin 4=>Char) = parseGroup parseAny
parse2: Parser (Fin 2=>Char) = parseGroup parseAny

def parse16 : Parser Int = MkParser \h.
srush marked this conversation as resolved.
Show resolved Hide resolved
bytesToInt16 $ parse h parse2

def pString (x:String) : Parser Unit = MkParser \h.
(AsList n ls) = x
iter \i .
srush marked this conversation as resolved.
Show resolved Hide resolved
case i < n of
True ->
_ = parse h (pChar ls.(i@_))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use unsafeFromOrdinal here. I guess we should add a !@! operator for that at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit? It is faster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, faster both to compile and to execute. With a downside that if you do end up casting an integer out of range then you can get memory corruption and segfaults.

Continue
False -> Done ()


' ### Wave Parsing and IO

' We define a parser to construct the header and to read in the data.

wavparser : Parser Wav = MkParser \h.
riff = parse h parse4
size = bytesToInt32 $ parse h parse4
type = parse h parse4
chunk = parse h parse4
_ = parse h parse16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally fine, but if you're not interested in an output then you also can just skip the _ = part. But this is perfectly reasonable.

formatlength = bytesToInt32 $ parse h parse4
channels = bytesToInt16 $ parse h parse2
samplerate = bytesToInt32 $ parse h parse4
_ = bytesToInt32 $ parse h parse4
_ = bytesToInt16 $ parse h parse2
bitspersample = bytesToInt16 $ parse h parse2
_ = parse h $ pString (AsList 4 ['d', 'a', 't', 'a'])
datasize = bytesToInt32 $ parse h parse4
x = parse h $ parseMany parse16
header = AsWavHeader {riff=riff, size=size,
type=type, chunk=chunk,
formatlength=formatlength,
channels=channels, samplerate=samplerate,
bitspersample=bitspersample,
datasize=datasize}
srush marked this conversation as resolved.
Show resolved Hide resolved
AsWav header x


' We add a helper function to read in the whole file.

def fullstream (stream : Stream ReadMode) : {IO} String =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please copy this to the prelude? The semantics of fread are quite silly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

yieldState (AsList _ []) \results.
iter \_.
str = fread stream
(AsList n _) = str
results := (get results) <> str
case n == 0 of
True -> Done ()
False -> Continue

x = unsafeIO do runParserPartial (withFile "examples/speech2.wav" ReadMode fullstream) wavparser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a public wav file? We should have some way of getting the necessary data if we're going to check in this notebook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do the same thing that I did for the tutorial. This is from an open licensed dataset https://github.com/Jakobovski/free-spoken-digit-dataset

(AsList len raw_sample_wav) = case x of
Nothing -> mempty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to raise an error here?

(Just (AsWav header dat)) -> dat

Datsize = Fin len
samplerate = 8000.0


' ### The Audio


' We are going to work with an MNist style of audio that has short snippets of
people saying letters.

s = unsafeIO do base64Encode (withFile "examples/speech2.wav" ReadMode fullstream)

:html "<audio controls src='data:audio/wav;base64,"<> s <>"'/>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is brilliant!



' Here is what this looks like as a wave.

:html showPlot $ xyPlot (for i. IToF (ordinal i)) (for i. IToF raw_sample_wav.i)


' ## MFCC Signal Processing

' Before doing machine learning on the data we need to process it.
We are going to follow a standard pipeline to produce MFCC features
for the data. The steps for this process are outlined in wikipedia.

' https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this into a markdown link?


' ### Signal Tools

' First we will define an unsafe window function that will let us easily take convolutions.

def window [Add out] (stride:Int) (ls: m => out) : (n => window => out) =
window_size = size window
for i: n.
case ((ordinal i) * stride + window_size) < (size m) of
True ->
j = ((ordinal i) * stride) @ m
k = ((ordinal i) * stride + window_size) @ m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd recommend unsafeFromOrdinal here. You're already doing the checks, although it might be worth adding an assert $ stride > 0 somewhere at the beginning of this function

x = for l':(j..k).
l = %inject l'
ls.l
unsafeCastTable window x

False -> zero

' Add arbitrary padding based on type.

def pad [Add out] (tab: m => out) : (n => out) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can't just get away with using a sum-typed index set for padding? It doesn't seem like you're using this function a whole lot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a fun idea. I was thinking of adding that. Something like?

def Padded (left:Int) (a:Type ) (right:Int) : Type = { left_pad : Fin left | content : a  | right_pad: Fin right }  

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly! Modulo the fact that the iteration order of a record is alphabetic, so you will have to add prefixes to make the fields ordered correctly... We'll fix that at some point.

for i.
case ordinal i < size m of
True -> tab.((ordinal i)@_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend unsafe casts here as well.

False -> zero


' ### Step 0: Preemphasis Smoothing

' As a warmup, we will play around with windowing by running a simple
convolution filter over the data. This preemphasis filter smooths out the signal.

-- Hyperparam
smoothing_coef = 0.97
def smoothing (x:Fin 2=>Float) : Float = x.(1@_) - (smoothing_coef * x.(0@_))

signal : Datsize => Float = map smoothing $ window 1 (for i. IToF raw_sample_wav.i)


' ### Step 1: Movement to Frequency domain

' Next we move from time to frequency domain by applying an FFT.

' Instead of applying on FFT on the whole sequence we break it up into overlapping windows and apply an FFT on each one. The windows in speech are known as frames, they are typically 0.025 seconds long. We use a stride of 0.01 seconds for overlap.

-- Hyperparam
WindowSize = Fin (FToI (0.025 * samplerate))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use an integer for sample rate and divide by 40 here and by 100 below? You can put the floats in the comments but this would be much more accurate.

stride = (FToI (0.01 * samplerate))


PostWindow = Fin (idiv (size Datsize) stride)

frame_split : PostWindow => WindowSize => Float = window stride signal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to make window take n=>a and return List a with size containing all valid elements? This would become:

(AsList numWindows frame_split) = window stride signal
PostWindow = Fin numWindows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely better.

I was thinking I could maybe avoid List's entirely and pad before windowing to keep a known size. But that might be too cute.



' One issue though is that the FFT does not change the size of its input,
so if we have a short sequence we will end up with low frequency space
resolution. We get around this by padding out the time sequence.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I follow the argument you're trying to make here, sorry 😕


NFFT = Fin 512

' Finally we only keep the positive components of the FFT which are the first half
plus one.

Positive = Fin ((idiv 512 2) + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to use size NFFT here, just to make it easier to adjust parameters?


fft_frames = for i.
prefft: NFFT => Float = pad frame_split.i
ff = fft_real prefft
for j:Positive. (complex_abs ff.((ordinal j)@NFFT)) / (IToF (size NFFT))


:html matshow fft_frames


' ## Step 2 / 3: Construct Scales

' The next problem is that we have the data in frequency-space which is not that useful for the
problem of speech recognition. We would like to move the data to a scale that is more focused on
the range of human communication and our biology.


' ### Mel FilterBank

' To do this we use a nonlinear rescaling known as the Mel Scale.

' Mel scale looks like this.

Hertz = Float
Mel = Float
def mel (f:Hertz) : Mel = 1125.0 * log (1.0 + (f / 700.0))
def imel (f:Mel) : Hertz = 700.0 * (exp (f / 1125.0) - 1.0)

' However it is now so easy to move between frequency scales. To do this we will need
' a bit of Dex machinery.

' ### Linear Scales

' Let us define a scaled range as a linear mapping between ordinals and a range.

LinearScale = {start:Float & end:Float}
data ScaledRange a = AsScaledRange scale:LinearScale
srush marked this conversation as resolved.
Show resolved Hide resolved

' Given a scale, we can easily map from an index to a point, get the step, get the full domain,
and take the binned inverse to get the nearest index.

def step (AsScaledRange {start, end}:ScaledRange a) : Float =
(1.0 / (IToF ((size a) - 1) )) * (end - start)
def scaled (AsScaledRange {start, end}:ScaledRange a) (x:a): Float =
start + ((IToF (ordinal x)) / (IToF ((size a) - 1))) * (end - start)
def domain (scale:ScaledRange a) : a => Float =
for i. scaled scale i
def bin (scale:ScaledRange a) (x:Float) : a =
(AsScaledRange {start, end}) = scale
(FToI (floor ((x - start) / (step scale))))@_

' ## Mapping Between Scales

' We can then define a two linear ranges. One in Hertz (corresponding to the FFT) and one in Mel Space.

MelBins = Fin 28

hscale : ScaledRange Positive = AsScaledRange {start=0.0, end=samplerate / 2.0}
melscale : ScaledRange MelBins = AsScaledRange {start=mel 0.0, end=mel (samplerate / 2.0)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning why do you pick samplerate / 2.0 as the end

domain melscale

mel_to_hscale = for i: MelBins.
mel = scaled melscale i
bin hscale (imel mel)


' ## Triangle Filter Bank

' In order to apply this mapping we need a way to handle the non-linear disconnect. To do this
We build a triangle windowing function.

:t fft_frames

def triangle_window (mpt:m) (mlower:m) (mupper:m) (scale: ScaledRange m) : m => Float =
srush marked this conversation as resolved.
Show resolved Hide resolved
pt = IToF (ordinal mpt)
lower = IToF (ordinal mlower)
upper = IToF (ordinal mupper)
for i'.
i = IToF (ordinal i')
if (i >= lower) && (i <= pt)
then (i - lower) / (pt - lower)
else if (i <= upper) && (i >= pt)
then (upper - i) / (upper - pt)
else 0.0

' The 25'th Mel bin is going to take in the mass from this range of hertz's bins.

:html showPlot $ xyPlot (domain hscale) (triangle_window mel_to_hscale.(25@_) mel_to_hscale.(24@_) mel_to_hscale.(26@_) hscale)

' Whereas the 19'th Mel bin is going to take in the mass from this range of hertz's bins.

:html showPlot $ xyPlot (domain hscale) (triangle_window mel_to_hscale.(19@_) mel_to_hscale.(18@_) mel_to_hscale.(20@_) hscale)


' Here is the full matrix.

start = 1@MelBins
end = 27@MelBins
MelBinsIn = Fin 26
H : MelBinsIn => Positive => Float = unsafeCastTable MelBinsIn $ for m' : (start..<end).
m = %inject m'
m1 = ((ordinal m) - 1)@MelBins
m2 = ((ordinal m) + 1)@MelBins
triangle_window mel_to_hscale.m mel_to_hscale.m1 mel_to_hscale.m2 hscale
srush marked this conversation as resolved.
Show resolved Hide resolved


' We can plot these over each other to get the triangles for each Mel bin.

d = for i. plotToDiagram $ xyPlot (domain hscale) H.i
:html renderScaledSVG $ concatDiagrams d

' The final Mel features are constructed applying a matrix multiply.

eps = 0.000001
melled : PostWindow => MelBinsIn => Float = for i. for m. log ((sum for j. fft_frames.i.j * H.m.j) + eps)
:html matshow for i j. melled.j.i

' ## Step 4: Transform the Mel spectrum

' The last step is to take a Discrete cosine transform in feature space. This will allow us to
further compress the signal down for usage. We start with 26 Mel features and after this step
we will end with 13 MFCCs.

' There is a nice property that the discrete cosine transform can be computed by running a
standard FFT on a transformed version of the data. We implement this based on

' [Fast Cosine Transform](https://dsp.stackexchange.com/questions/2807/fast-cosine-transform-via-fft)

AB = {A : Unit | B : Unit}

def dct (input : m=>Float) : m=>Complex =
d = for (k, i, j) : (AB & m & AB).
case j of
{|B=()|} -> case k of
{|A=()|} -> input.i
{|B=()|} -> input.(((size m)-1-(ordinal i))@_)
{|A=()|} -> 0.0
d' = unsafeCastTable (AB & AB & m) $ fft_real d
for i. d'.({|A=()|}, {|A=()|}, i)

' Now we just apply DCT along each Mel window to get our final values.
From this we drop the top DCT coefficient and keep the next 13.

MFCCStart = 1@MelBinsIn
MFCCLen = 13@MelBinsIn
mfcc = for i.
d = dct melled.i
for j': (MFCCStart..MFCCLen).
j = %inject j'
(MkComplex m _) = d.j
m

' Finally there is one last heuristic (lifting) that is applied. This readjusts the
coefficients based on higher-frequencies.

lift_coef = 22.0

mfcc_lifted = for i j.
-- Lifting
n = IToF (ordinal j)
lift = 1.0 + (lift_coef/2.)*sin(pi* n /lift_coef)
mfcc.i.j * lift


:html matshow $ for i j. mfcc_lifted.j.i


' We now have our MFCC features which are ready for training or analysing speech data.


Loading