Implement the speech-to-text training pipeline#77
Conversation
Test Results308 tests ±0 304 ✅ ±0 1m 9s ⏱️ -43s Results for commit dc48fbd. ± Comparison against base commit 6854c1d. This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both. |
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive ML pipeline for speech-to-text model training using DVC (Data Version Control) for reproducibility. The pipeline generates synthetic training data from text phrases, applies audio augmentations (delays, background noise, microphone noise), trains a neural speech recognition model using TensorFlow/Keras with CTC loss, and evaluates the model's performance.
Changes:
- DVC pipeline configuration with 11 stages covering data generation, augmentation, training, and evaluation
- Python scripts for intent phrase generation with linguistic variations (pleasantries, hesitations, spelling variants)
- Speech synthesis pipeline using edge-tts to generate audio samples with randomized voice characteristics
- Audio augmentation scripts for adding realistic noise and delays to improve model robustness
- Neural model training with CNN+BiLSTM architecture and CTC loss for sequence-to-sequence learning
- Model evaluation scripts computing Word Error Rate (WER) and generating prediction reports
- Documentation describing the ML pipeline design and usage
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 33 comments.
Show a summary per file
| File | Description |
|---|---|
| ml/dvc.yaml | Defines 11-stage DVC pipeline orchestrating all data preparation, training, and evaluation steps |
| ml/dvc.lock | Lock file tracking exact versions and hashes of all pipeline dependencies and outputs |
| ml/_doc_ml.md | Comprehensive documentation of ML pipeline architecture, stages, and developer quickstart guide |
| .dvc/config | DVC remote configuration pointing to S3 bucket for artifact storage |
| .dvc/.gitignore | Git ignore rules for DVC local cache and temporary files |
| .dvcignore | DVC-specific ignore patterns for performance optimization |
| ml/scripts/requirements.txt | Python package dependencies for ML pipeline execution |
| ml/scripts/intent_prediction/01_input_phrases.csv | Base command phrases mapped to canonical labels for intent classification |
| ml/scripts/intent_prediction/01_generate_phrases.py | Generates 10,000 phrase variations with linguistic transformations for robust training |
| ml/scripts/speech_to_text/01a_generate_speech_sample_variations.py | Creates TTS parameter variations (voice, rate) for each phrase |
| ml/scripts/speech_to_text/01_generate_speech_samples.py | Synthesizes audio samples using edge-tts with randomized characteristics |
| ml/scripts/speech_to_text/02a_randomize_delay_variations.py | Generates random prefix/suffix silence durations for each sample |
| ml/scripts/speech_to_text/02_add_delays.py | Applies silence padding to audio files based on randomized delays |
| ml/scripts/speech_to_text/03a_download_background_noise.py | Downloads realistic background noise samples from Freesound |
| ml/scripts/speech_to_text/03_add_background_noise.py | Mixes background noise into audio samples at random volumes |
| ml/scripts/speech_to_text/04_add_microphone_noise.py | Adds synthetic microphone noise to simulate recording conditions |
| ml/scripts/speech_to_text/05_create_set_manifests.py | Splits augmented audio into train/val/test sets and creates manifest CSVs |
| ml/scripts/speech_to_text/06_create_vocab_list.py | Extracts unique vocabulary from training phrases for model output space |
| ml/scripts/speech_to_text/07_compute_spectrograms.py | Computes log-mel spectrograms and tokenizes transcriptions for model input |
| ml/scripts/speech_to_text/08_train_model.py | Trains CNN+BiLSTM model with CTC loss using custom training loop |
| ml/scripts/speech_to_text/09_evaluate_model.py | Evaluates model on validation set, computes WER, and saves predictions |
| ml/scripts/speech_to_text/10_evaluate_test_samples.py | Creates ZIP of successfully recognized test samples for quality inspection |
| parser = argparse.ArgumentParser(description="Train speech-to-text model.") | ||
| parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv') | ||
| parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)') | ||
| parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt') | ||
| parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files') | ||
| parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model') |
There was a problem hiding this comment.
The help text for the --output-dir argument says "Directory for output model" but this script doesn't output a model. It outputs evaluation results (predictions and metrics). The help text should be "Directory for evaluation results" or similar.
| parser = argparse.ArgumentParser(description="Train speech-to-text model.") | |
| parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv') | |
| parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)') | |
| parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt') | |
| parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files') | |
| parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model') | |
| parser = argparse.ArgumentParser(description="Evaluate speech-to-text model.") | |
| parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv') | |
| parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)') | |
| parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt') | |
| parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files') | |
| parser.add_argument('--output-dir', type=Path, required=True, help='Directory for evaluation results (predictions and metrics)') |
| return log_S | ||
|
|
||
| def compute_tokens(vocab_list, transcription): | ||
| transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower()) |
There was a problem hiding this comment.
The compute_tokens function has an issue with vocabulary generation consistency. In 06_create_vocab_list.py, punctuation is replaced with spaces and words are tokenized, but in this function (line 48), punctuation is also replaced with spaces before tokenization. However, 06_create_vocab_list.py uses phrase.replace(',', ' ') which only replaces commas, while this uses re.sub(r"[^a-z0-9\s]", " ", ...) which removes all punctuation. This inconsistency means words might not be found in the vocabulary even when they should be present. Both scripts should use the same tokenization and normalization logic.
| transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower()) | |
| transcription = transcription.lower().replace(',', ' ') |
| # Build the model | ||
| with open(paths.vocab, 'r', encoding='utf-8') as vocabfile: | ||
| vocab_list = [line.strip() for line in vocabfile if line.strip()] | ||
| num_classes = len(vocab_list) + 1 # +1 for CTC blank token |
There was a problem hiding this comment.
The CTC blank token index calculation is inconsistent across scripts. In this script (line 38), num_classes = len(vocab_list) + 1 is used, but in the evaluation scripts (09 and 10), the blank index is calculated as ctc_blank_idx = len(vocab_list) + 1 on line 59. However, in CTC, the blank token is typically at index len(vocab_list) (0-indexed), not len(vocab_list) + 1. This means the model is trained with one number of output classes but evaluated with a different blank index, which will cause mismatches. The correct calculation should be num_classes = len(vocab_list) + 1 (to include the blank), and ctc_blank_idx = len(vocab_list) (the last index).
| ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token | ||
| print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}") |
There was a problem hiding this comment.
The CTC blank token index calculation is incorrect. Here ctc_blank_idx = len(vocab_list) + 1, but it should be ctc_blank_idx = len(vocab_list) to match the 0-indexed output from the model which has num_classes = len(vocab_list) + 1 dimensions. The blank token is conventionally at the last index, which would be len(vocab_list) when using 0-based indexing.
| ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token | |
| print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}") | |
| ctc_blank_idx = len(vocab_list) # CTC blank token is conventionally at the last index | |
| print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {len(vocab_list) + 1}") |
| input_layer = Input(shape=(n_mels, time_steps), name='input') | ||
| x = layers.Reshape((n_mels, time_steps, 1))(input_layer) |
There was a problem hiding this comment.
The input shape for the model is defined as (n_mels, time_steps) which is (80, 360), but in 07_compute_spectrograms.py, the spectrogram is saved as log_S which has shape (n_mels, time_steps) = (80, 360) via librosa's melspectrogram output. However, the model then reshapes this to (n_mels, time_steps, 1) on line 42. This is correct for a 2D conv operation, but the input data is being loaded as-is without being reshaped. The training code loads spectrograms with np.load(spectrogram_file) and passes them directly to the model, which expects shape (80, 360). This should work, but verify that the data shape matches what the model expects during training.
| import pandas as pd | ||
| import random | ||
| import os | ||
| import asyncio |
There was a problem hiding this comment.
Import of 'asyncio' is not used.
| import asyncio |
| import random | ||
| import os | ||
| import asyncio | ||
| import edge_tts |
There was a problem hiding this comment.
Import of 'edge_tts' is not used.
| import edge_tts |
| import os | ||
| import asyncio | ||
| import edge_tts | ||
| from tqdm import tqdm |
There was a problem hiding this comment.
Import of 'tqdm' is not used.
| from tqdm import tqdm |
| @@ -0,0 +1,104 @@ | |||
| import argparse | |||
| import csv | |||
There was a problem hiding this comment.
Import of 'csv' is not used.
| import csv |
ml/scripts/requirements.txt
Outdated
| edge-tts | ||
| pandas | ||
| tqdm | ||
| pydub | ||
| soundfile | ||
| librosa | ||
| tensorflow | ||
| tensorflow.keras | ||
| onnx | ||
| tf2onnx | ||
| jiwer No newline at end of file |
There was a problem hiding this comment.
The ml/scripts/requirements.txt file lists multiple third-party Python packages without version pinning, which means pip install -r scripts/requirements.txt will always pull the latest, mutable versions from PyPI. This creates a supply-chain risk where a compromised or hijacked future release of any of these packages could execute attacker-controlled code in your training environment (with access to DVC credentials and artifacts). To reduce this risk, pin each dependency to a specific, known-good version (or immutable reference) and manage upgrades explicitly via review.
|
@copilot address the code review comments in this PR. |
…eline (#78) * Initial plan * Address code review comments - fix critical issues and clean up code Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com> * Update onnx to 1.17.0 to fix security vulnerabilities Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>
|
Abandoning this PR so we can focus on a phoneme-based implementation |
Use DVC to create a reproducible pipeline to generate speech samples and train a neural ML model for speech recognition, which can be tuned to the individual user and command space for Adaptive Remote.