Skip to content

Commit

Permalink
v. 0.3.2
Browse files Browse the repository at this point in the history
  • Loading branch information
solyarisoftware committed Dec 14, 2020
1 parent 38b0daa commit 19a04c0
Show file tree
Hide file tree
Showing 2 changed files with 127 additions and 56 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Changelog

## v. 0.3.1
## v. 0.3.2

- README documentation updated
- video demo added
Expand Down
181 changes: 126 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ WeBAD supply a solution for two specific scenarios:
You want to record the audio blob from when the user start to talk,
to when the user finish the spoken utterance!

| [![](https://img.youtube.com/vi/aY1eZLPZhDw/0.jpg)](https://www.youtube.com/watch?v=aY1eZLPZhDw&feature=youtu.be "continuous mode speech detection on a mobile phone")|
|:--:|
| Instant gratification video demo: continuous mode speech detection on a mobile phone |


## What's a speech message?

Expand Down Expand Up @@ -106,11 +110,13 @@ Let's see some possible scenarios:
Currently this is considered the common way to push speech messages on a voice interfaced system.
Wake word detection, especially if you want to have your own custom word sequences,
need a specialized training of a neural net and a cpu-intensive run-time engine
that has to run on the browser. WeBAD just escapes from this approach.
that has to run on the browser.

> WeBAD just escapes from wake word approach. Some solutions in [references](#references)
- (2) Push-to-talk

That's the traditional/safe way to generate audio messages
That's the traditional reliable way to generate audio messages
(see radio mobile/walkie-talkie).
The user push a button, start to talk, release the button when finished to talk.
Note that push to talk could be implemented on the browser in two way:
Expand All @@ -125,13 +131,51 @@ Let's see some possible scenarios:
the user press a key or touch a (button on the) screen to talk.
But that is not a touch-less / keyboard-less solution.

- (2.2) **Hardware-button push-to-talk**
- (2.2) **Hardware push-button push-to-talk**

The user press a real/hardware push-button, that mute/un-mute an external mic.
Here a simplified schematics about how the mic setup:

- Normally-closed push-button

PTT push-button short-circuited to ground (default): exit signal is ~0

```
.-----------+-----------. (+)
| | |
.--+--. + .------>
| mic | | jack out male mono (mini jack 3.5mm)
.--+--. | .------>
| + |
.-----------+-----------. ground
^
|
normally-closed PTT push-button
```
- Open (pressed) push-button
When the user want to talk, he push the PTT push-button.
The exit signal become >> 0
```
.-----------+-----------. (+)
| | |
.--+--. + .------>
| mic | / jack out male mono (mini jack 3.5mm)
.--+--. / .------>
| + |
.-----------+-----------. ground
^
|
Open PTT push-button
```
The user press a real/hardware button, that maybe mute/un-mute an external mic.
- (3) **Continuous listening** (without wake-word detection)
A great experience is maybe the *continuous listening* mode,
A better voice-interface user experience is maybe through a *continuous listening* mode,
where audio is detected in real-time,
just talking in front of the PC (or the tablet/ mobile phone / handset).
Namely: avoiding any wake-word detection algorithm.
Expand All @@ -141,8 +185,6 @@ WeBAD focuses on the two last scenarios (2.2) and (3).
## Which are the possible applications?
Let's focus on these two specific application contexts:

- **Mobile device voice-interface client for operators that can't use the touch-screen**
The target scenario is a situation where the user can't easily touch the screen of a mobile device.
Expand All @@ -158,28 +200,6 @@ Let's focus on these two specific application contexts:
(or a mobile phone) in a pretty quiet environment,
by example in a room apartment or a quite office, or inside a vehicle.
### Does continuous listening includes wake-word UI?

An interesting plus point of continuous listening is that it "includes" the wake word mechanics.
In a standard wake-word approach, the voicebot is activated with a unique associated wake-word.
Two common examples are:

- *Alexa, do this...*
- *Ok Google, do that...*

But with continuous listening, the computer-side interlocutor
is no more determined by a single wake word itself, but it's part of the utterance
to be elaborated by the voicebot.
That's smart because WeBAD is now a single interface for multiple *interlocutor-bots*.

Suppose a voice metabot (main bot) made by different component voicebots,
maybe each one dedicated to specific skills.
The user could invoke each different subsystems in an natural way:

- *Computer, please slow down velocity*
- *Alexa, what time is it?*
- *Ok Google, tell-me a joke*

## WeBAD Event-bus API solution
Expand All @@ -200,7 +220,7 @@ and generates these javascript events:
| `mute` | audio volume is almost zero, the mic is off |
| `silence` | audio volume is pretty low, the mic is on but there is not speech |
| `signal` | audio volume is high, so probably user is speaking |
| `clipping` | audio volume is too high, clipping (**TODO**) |
| `clipping` | audio volume is too high, clipping. **TODO** |
- MICROPHONE STATUS EVENTS
Expand Down Expand Up @@ -234,7 +254,7 @@ that trigger events and maintains a current state, with this discrete values:
| `unmute` | The micro is open, or unmuted |
| `silence` | The microphone is open. Volume is almost silence (less than silence_threshold_value), containing just background noise, not containing sufficient signal power that probabilistically correspond to speech |
| `signal` | The signal level is pretty high, probabilistically corresponding to speech |
| `clipping` | The signal level is too high (volume is ~= 1) |
| `clipping` | The signal level is too high (volume is ~= 1). **TODO** |
```
volume
Expand Down Expand Up @@ -289,16 +309,15 @@ mute 0.0 +----------------------------------------------------------------------
unmutemic mutemic
```

## Recording modes
On the basis of the microphone / hardware configuration available,
there are some different possible ways to record speech:
- Using an external microphone, bound to a push-to-talk hardware button
- Using an external microphone, bound to a push-to-talk hardware push-button
In this scenario, the continuous mode could be substituted by a push-to-talk experience,
where user has to push a real button every time he want to submit a speech,
where user has to press a real push-button every time he want to submit a speech,
releasing the button when he explicitly want to terminate recording.
To accomplish this case we use two different events:
Expand All @@ -312,10 +331,6 @@ there are some different possible ways to record speech:
- `prespeechstart` start speech recording
- `speechstop` stop speech recording
> WeBAD just triggers above listed events. What is now out of scope of this project:
> - how to use events to record the audio recordings
> - how to use/process blob audio messages
> (probably you want to send them to a backend server via socketio or websockets).
### Push-to-talk recording
Expand Down Expand Up @@ -398,7 +413,7 @@ document.addEventListener('speechabort', event => {
})
```

#### Continuous listening algorithm details
#### Preemptive-recording algorithm

`speechstart` event could seem a good candidate to start speech recording,
as soon a signal (exceeding of a threshold) is detected in the `audioDetection()`
Expand Down Expand Up @@ -497,21 +512,20 @@ The WeBAD algorithm is based upon a set of parameters:
```

## Install
## Installation

```bash
$ git clone https://github.com/solyarisoftware/webad
```

## Run the demo

On top of the WeBAD JS library,
this repo supply a web page demo that how manage events generated by WeBAD.
A very basic web page that:
On top of the WeBAD javascript library,
this repo supply a web page demo that shows how manage events generated by WeBAD.
A very basic web page:
- shows events changes
- record speech in real-time and plays the recorded audio/speech as soon the recording finish.


You can run the demo on your localhost, by example using firefox browser (suggested choice):

```bash
Expand Down Expand Up @@ -677,7 +691,10 @@ Average Signal dB : -24

- Firefox 👏👏👏👏

All run smoothly! (Tested on Windows 10 / Linux Ubuntu 20.04 Desktop)
All run smoothly! Tested on:
- Windows 10 personal computer
- Linux Ubuntu 20.04 Desktop
- Recent Android v.10 mobile phone

- Chrome/Brave

Expand All @@ -691,10 +708,6 @@ Average Signal dB : -24
`Failed to execute 'stop' on 'MediaRecorder': The MediaRecorder's state is 'inactive'`
Workaround found (see: demoAudioRecorder.js).

### Video demo: continuous mode speech detection on a mobile phone

[![IMAGE ALT TEXT](https://img.youtube.com/vi/aY1eZLPZhDw/0.jpg)](https://www.youtube.com/watch?v=aY1eZLPZhDw&feature=youtu.be "continuous mode speech detection on a mobile phone")


## Use WeBAD library in your application

Expand All @@ -718,7 +731,7 @@ Average Signal dB : -24

## To do

- [ ] Demo web page
- Demo web page
- add input boxes for significant parameters,
allowing to modify parameters in real-time

Expand All @@ -727,11 +740,42 @@ Average Signal dB : -24

- made a visually decent page

- [ ] add clipping event
- Add clipping event

- Better explain the parameter tuning issues

- [ ] Explain the parameter tuning issue
- Please Giorgio, remove global vars and transform the ugly "all-see-all" in ES6 JS modules!

- [ ] Please Giorgio, remove global vars and transform the ugly "all-see-all" in ES6 JS modules!
- WeBAD just triggers above listed events.

What is now out of scope of this release:
- how to use events to record the audio recordings
- how to use/process blob audio messages
(probably you want to send them to a backend server via socketio or websockets).

## Discussion / Open points

- Does continuous listening includes wake-word UI?

An interesting plus point of continuous listening is that it "includes" the wake word mechanics.
In a standard wake-word approach, the voicebot is activated with a unique associated wake-word.
Two common examples are:

- *Alexa, do this...*
- *Ok Google, do that...*

But with continuous listening, the computer-side interlocutor
is no more determined by a single wake word itself, but it's part of the utterance
to be elaborated by the voicebot.
That's smart because WeBAD is now a single interface for multiple *interlocutor-bots*.

Suppose a voice metabot (main bot) made by different component voicebots,
maybe each one dedicated to specific skills.
The user could invoke each different subsystems in an natural way:

- *Computer, please slow down velocity*
- *Alexa, what time is it?*
- *Ok Google, tell-me a joke*


## How to contribute
Expand All @@ -740,15 +784,42 @@ Any contribute is welcome. Maybe you want to:
- open a new discussion a specific topic opening a post [here](https://github.com/solyarisoftware/WeBAD/discussions)
- contact me via [e-mail](mailto:giorgio.robino@gmail.com)

## References

- Hotkey

Article: [Speechly Guidelines for Creating Productive Voice-Enabled Apps](https://www.speechly.com/blog/voice-application-design-guide/)

- Silence detection

[Web Audio API: how can I detect speech and record until silence, with or without a Push-To-Talk button](https://stackoverflow.com/questions/62114251/web-audio-api-how-can-i-detect-speech-and-record-until-silence-with-or-without/62212935#62212935)

- Wake word detection - some solutions

- Porcupine
https://picovoice.ai/blog/offline-voice-ai-in-a-web-browser/

- Howl
https://github.com/castorini/howl
https://arxiv.org/pdf/2008.09606.pdf

- Snowboy
https://github.com/kitt-ai/snowboy

- Raven
https://github.com/rhasspy/rhasspy-wake-raven

- Mycroft Precise
https://github.com/MycroftAI/mycroft-precise



## Credits

- Foundation component: I used the volume-meter Web Audio API script processor,
written by Chris Wilson here available: https://github.com/cwilso/volume-meter
👏👏👏👏

- Article: [Speechly Guidelines for Creating Productive Voice-Enabled Apps](https://www.speechly.com/blog/voice-application-design-guide/)


## License

Expand Down

0 comments on commit 19a04c0

Please sign in to comment.