Skip to content
EyalLavi edited this page Jun 12, 2019 · 12 revisions

Why make a framework?

Speech-to-text has become an important tool for organisations that process large volumes of audio-visual content. As the technology advances, the need to choose the right STT provider is becoming more acute.

  • Speech to text can get very expensive at scale: media bodies often need to transcribe millions of hours of audio.
  • The STT market is expanding rapidly, with specialised models and vendors coming onto the market. Vendors compete on accuracy, speed, cost and unique features, which makes evaluations difficult.
  • Benchmarking is expensive: it requires large volumes of manually-prepared test data and development work to integrate with multiple vendor APIs and to process multiple transcript formats.
  • Different use cases require different metrics for evaluation. For example, for some organisations multi-lingual support is important, while others are interested in the accuracy of word timings.
  • There are no agreed implementations for some of the metrics currently in use.
  • Some of the metrics can be improved. For example, Word Error Rate does not take into account the type of word replacements and their effect on comprehension (but applying metrics consistency is more important than defining a ‘perfect’ metric).

This framework aims to address these difficulties and to make benchmarking easier by sharing resources and developing common techniques and guidelines. It recognises however that there is no one-size-fits-all STT: each use case will require its own data sets and metrics.

What do we mean by 'framework'?

Conceptually, the framework is composed of three parts: an open source toolkit (this repo), a set of guidelines and algorithms that contribute to the development of the toolkit and organisational support and hosting. See below for details of each.

framework

Part 1: Toolkit

At the heart of the framework is this repo: a modular library for benchmarking STT vendors. The toolkit performs these tasks:

  • Prepare and normalise test data
  • Connect to STT vendors and retrieve the transcripts
  • Normalise the different transcript formats
  • Apply metrics consistently to the transcripts
  • Present the results as a matrix of metrics per vendor. The user will use this matrix to determine the best provider for their test data/use case. Here is what this matrix can look like:
Metric Vendor 1 Vendor 2 Vendor 3
WER 89% 85% 79%
Processing time x0.5 x0.79 x1.1
Speakers identified 1 3 1

It's important to note that the framework is not intended to provide a simple ranking of STT vendors. The user will have to decide, based on the combination of test data and metrics, which vendor is suited for their use case.

Part 2: Guidelines

This part provides the 'theoretical' input to the toolkit. The code will require:

  • Defining a format for test data (e.g. how to express speakers)
  • Normalisation rules for input test data (e.g. ‘back up’/‘backup’/’back-up’)
  • Normalisation rules for output transcripts
  • Calculation of metrics consistently

With time, it is hoped that this set of guidelines will become widely used, promoting interoperability.

Part 3: Hosting and support

In addition to hosting this repo and providing resources through its members, the EBU may provide additional support such as:

  • Provide a hosted instance of the toolkit to members (e.g. web UI)
  • Commission test data sets, possibly using material from members (e.g. audio+subtitles)
  • Liaise between broadcasters and vendors
  • Centrally manage vendor deployments, access and licensing for the toolkit.
  • Promote the guidelines and undertake related standardisation activities.

Use cases for benchmarking STT

These are a few examples of use cases and relevant metrics for benchmarking.

Use case Main metric(s)
Prepared subtitles WER, timing
Live subtitles Latency, WER
Re-synching of subtitles captured from live Timing
Keyword identification Proper noun recognition
Archive metadata extraction WER, noise tolerance, dialects
News monitoring Proper noun recognition, voice identification