Speech Data Processor (SDP) Toolkit

The Speech Data Processor (SDP) is a toolkit designed to simplify the processing of speech datasets. It minimizes the boilerplate code required and allows for easy sharing of processing steps. SDP's philosophy is to represent processing operations as 'processor' classes, which take in a path to a NeMo-style data manifest as input (or a path to the raw data directory if you do not have a NeMo-style manifest to start with), apply some processing to it, and then save the output manifest file.

Features

Creating Manifests: Generate manifests for your datasets.
Running ASR Inference: Automatically run ASR inference to remove utterances where the reference text differs greatly from ASR predictions.
Text Transformations: Apply text-based transformations to lines in the manifest.
Removing Inaccurate Transcripts: Remove lines from the manifest which may contain inaccurate transcripts.
Custom Processors: Write your own processor classes if the provided ones do not meet your needs.

Installation

SDP is officially supported for Python 3.10, but might work for other versions.

Clone the repository:

   git clone https://github.com/NVIDIA/NeMo-speech-data-processor.git
   cd NeMo-speech-data-processor

Install dependencies:

   pip install -r requirements/main.txt

Optional: If you need to use ASR, NLP parts, or NeMo Text Processing, follow the NeMo installation instructions:
- NeMo Installation

Example:

In this example we will load librispeech using SDP.
- For downloading all available data - replace config.yaml with all.yaml
- For mini dataset - replace with mini.yaml.

    python NeMo-speech-data-processor/main.py \
    --config-path="dataset_configs/english/librispeech" \
    --config-name="config.yaml" \
    processors_to_run="0:" \
    workspace_dir="librispeech_data_dir"

Usage

Create a Configuration YAML File:

Here is a simplified example of a config.yaml file:

processors:
  - _target_: sdp.processors.CreateInitialManifestMCV
    output_manifest_file: "${data_split}_initial_manifest.json"
    language_id: es
  - _target_: sdp.processors.ASRInference
    pretrained_model: "stt_es_quartznet15x5"
  - _target_: sdp.processors.SubRegex
    regex_params_list:
      - {"pattern": "¡", "repl": "."}
      - {"pattern": "ó", "repl": "o"}
    test_cases:
      - {input: {text: "hey!"}, output: {text: "hey."}}
  - _target_: sdp.processors.DropNonAlphabet
    alphabet: "abcdefghijklmnopqrstuvwxyzáéiñóúüABCDEFGHIJKLMNOPQRSTUVWXYZÁÉÍÑÓÚÜ"
    test_cases:
      - {input: {text: "test Тест ¡"}, output: null}
      - {input: {text: "test"}, output: {text: "test"}}
  - _target_: sdp.processors.KeepOnlySpecifiedFields
    output_manifest_file: "${data_split}_final_manifest.json"
    fields_to_keep:
      - "audio_filepath"
      - "text"
      - "duration"

Run the Processor:

Use the following command to process your dataset:

   python <SDP_ROOT>/main.py \
     --config-path="dataset_configs/<lang>/<dataset>/" \
     --config-name="config.yaml" \
     processors_to_run="all" \
     data_split="train" \
     workspace_dir="<dir_to_store_processed_data>"

To learn more about SDP, have a look at our documentation.

Contributing

We welcome community contributions! Please refer to the CONTRIBUTING.md for the process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speech Data Processor (SDP) Toolkit

Features

Installation

Example:

Usage

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speech Data Processor (SDP) Toolkit

Features

Installation

Example:

Usage

Contributing