A Brief Discussion on the Speech Processing API

My initial ideas and my approach to build it

Before GSoC, I was hardly aware of this area, and so I didn't have any idea on how or where to start. I started using CMUSphinx because it was mainly for offline speech recognition. In the beginning I was just running commands like this:

pocketsphinx_continuous -inmic yes -lm charactes.lm -dict characters.dic > decode.txt

to recognize my continuous input from the microphone, and if possible, decode the results into a file to be used later. But I really couldn't get a grip on how to process the text file efficiently, because it was decoding silence intervals too, thus giving many blank output lines in the resulting file. My initial approach of applying VUI into my project was that I would try to parse the file contents every 't' seconds or so...and use it to provide input values to my games, but it was a faulty approach, as I didn't have any idea which word, if at all any, would be extracted out from the file. The fact was that my Speech Decoding Process was going completely independent of my games and I had to start both processes manually, and stop them too, manually. This approach was doomed to fail and I had to think of other ways to efficiently manage my problems.

The solution was a script to wait until the first word is spoken, and neglect intervals of silence in between. Specifically, the games that I was building during my project period needed regular inputs from the user, so a slight mistiming of the input could cause significant errors. I decided to pause the game when asking for the input everytime and resume it only after the speech is decoded completely into text. The input is live speech from the microphone, which gets recorded for processing, and split into different audio files based on the silence intervals.

API modules description

The Speech Processing API mainly consists of the recorder and edit modules.

class API.recorder.Recorder(DEFAULT_LM_PATH, DEFAULT_AM_PATH, CHANNELS=1, RATE=16000, CHUNK_SIZE=1024, MIN_VOLUME=1600, OUTPUT_DIR='wav', SILENCE=3, TRIALS=None, MULTI=False, DECODE=False, L_LIB=None, A_LIB=None, TRANSCRIBE=False, OUTPUT_SHELL=None)

Here we discuss about their definitions and basic usage. This basically allows for having a recording interface with multiple implementations.

DEFAULT_LM_PATH -> This gives the relative path to the directory where the language models for speech decoding are kept.
DEFAULT_AM_PATH -> This gives the relative path to the directory where the acoustic models for speech decoding are kept.
CHANNELS -> Generally should be kept mono, in order to effectively decode.
RATE -> The sampling rate specified for audio recording and decoding. 16kHz is preferred because CMUSphinx uses it.
CHUNK_SIZE -> The number of frames, the signals are split into. It is like a buffer which can be kept or thrown away.

Utility :
We use chunks of data instead of a continuous audio, because of processing power. 
In terms of embedded development environments like the BeagleBone, chunking makes stream flow easier and prevents memory leaks.
We can also perform analysis of the individual chunks to determine what to do with it.

MIN_VOLUME -> The minimum threshold volume of the chunk to be recorded, can be set according to the requirements of the user.
OUTPUT_DIR -> The output directory of the audio files generated.
SILENCE -> The intervals between successive audio inputs, that can be quite useful in segregating the audio into different files.
TRIALS -> The number of times audio input is required for a specific task. If set to None, then it will record indefinitely, until Keyboard Interrupt is pressed.
MULTI -> This option is used to determine if the user wants to save the audio in multiple files or not. If set to None, it overrides the previous audio file with a new one.
DECODE -> Checks whether the user wants to decode the audio file into text or just record it.
L_LIB -> Links to the reference language model, between the options 'commands', 'characters', and 'num' for this project.
A_LIB -> Links to the reference acoustic model to be used for decoding.
TRANSCRIBE -> Used to specify whether the user will enter the actual transcription of the audio too, for analysis. Mainly required in the accuracy_checker scripts which in turn help in improving the phonetic dictionary.
OUTPUT_SHELL -> Specifies the output shell script to be run after decoding the text from the audio. Though an optional parameter, it is mainly used to apply the the decoded text in some other process. For information on usage in the project check out the game_launcher scripts and the game codes.

This implementations were useful in designing the game-structures. This interface can also be used in building a Conversational UI mainly applicable for making Speech-Assistants with the PocketBeagle. Like for working on a series of commands sequentially, we can set the MULTI to True, to split the audio into different files to be processed later, as well as keep TRIALS to default value None, to keep on taking speech inputs indefinitely. Improvements on the logic are gladly welcome.

Ways to efficiently interact using voice

There are three types of input that a user can speak out, the Commands, the Characters or alphabets, and the Numbers.
There are different recommended silence times between successive inputs. For launching the games, the recommended silence duration is 3 seconds between sucessive commands, and for the game-inputs, it is 1 second.
The user should speak out clearly, and as far as possible avoid noisy environments. As the AGC microphone is very sensitive, it often picks up background noise too, and it distorts the overall input.
If the volume of the input chunk is greater than a threshold value, it gets recorded in an audio file signified by '0' in the console logs. If the chunk is not recorded, it is signified by '-' in the console logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Brief Discussion on the Speech Processing API

My initial ideas and my approach to build it

API modules description

Ways to efficiently interact using voice

Clone this wiki locally