Skip to content

Implement the speech-to-text training pipeline#77

Closed
jodavis wants to merge 2 commits intomainfrom
dev/jodavis/ADR-50-speech-to-text
Closed

Implement the speech-to-text training pipeline#77
jodavis wants to merge 2 commits intomainfrom
dev/jodavis/ADR-50-speech-to-text

Conversation

@jodavis
Copy link
Owner

@jodavis jodavis commented Feb 3, 2026

Use DVC to create a reproducible pipeline to generate speech samples and train a neural ML model for speech recognition, which can be tuned to the individual user and command space for Adaptive Remote.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Test Results

308 tests  ±0   304 ✅ ±0   1m 9s ⏱️ -43s
  5 suites ±0     4 💤 ±0 
  5 files   ±0     0 ❌ ±0 

Results for commit dc48fbd. ± Comparison against base commit 6854c1d.

This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both.
,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid
,False)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive ML pipeline for speech-to-text model training using DVC (Data Version Control) for reproducibility. The pipeline generates synthetic training data from text phrases, applies audio augmentations (delays, background noise, microphone noise), trains a neural speech recognition model using TensorFlow/Keras with CTC loss, and evaluates the model's performance.

Changes:

  • DVC pipeline configuration with 11 stages covering data generation, augmentation, training, and evaluation
  • Python scripts for intent phrase generation with linguistic variations (pleasantries, hesitations, spelling variants)
  • Speech synthesis pipeline using edge-tts to generate audio samples with randomized voice characteristics
  • Audio augmentation scripts for adding realistic noise and delays to improve model robustness
  • Neural model training with CNN+BiLSTM architecture and CTC loss for sequence-to-sequence learning
  • Model evaluation scripts computing Word Error Rate (WER) and generating prediction reports
  • Documentation describing the ML pipeline design and usage

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 33 comments.

Show a summary per file
File Description
ml/dvc.yaml Defines 11-stage DVC pipeline orchestrating all data preparation, training, and evaluation steps
ml/dvc.lock Lock file tracking exact versions and hashes of all pipeline dependencies and outputs
ml/_doc_ml.md Comprehensive documentation of ML pipeline architecture, stages, and developer quickstart guide
.dvc/config DVC remote configuration pointing to S3 bucket for artifact storage
.dvc/.gitignore Git ignore rules for DVC local cache and temporary files
.dvcignore DVC-specific ignore patterns for performance optimization
ml/scripts/requirements.txt Python package dependencies for ML pipeline execution
ml/scripts/intent_prediction/01_input_phrases.csv Base command phrases mapped to canonical labels for intent classification
ml/scripts/intent_prediction/01_generate_phrases.py Generates 10,000 phrase variations with linguistic transformations for robust training
ml/scripts/speech_to_text/01a_generate_speech_sample_variations.py Creates TTS parameter variations (voice, rate) for each phrase
ml/scripts/speech_to_text/01_generate_speech_samples.py Synthesizes audio samples using edge-tts with randomized characteristics
ml/scripts/speech_to_text/02a_randomize_delay_variations.py Generates random prefix/suffix silence durations for each sample
ml/scripts/speech_to_text/02_add_delays.py Applies silence padding to audio files based on randomized delays
ml/scripts/speech_to_text/03a_download_background_noise.py Downloads realistic background noise samples from Freesound
ml/scripts/speech_to_text/03_add_background_noise.py Mixes background noise into audio samples at random volumes
ml/scripts/speech_to_text/04_add_microphone_noise.py Adds synthetic microphone noise to simulate recording conditions
ml/scripts/speech_to_text/05_create_set_manifests.py Splits augmented audio into train/val/test sets and creates manifest CSVs
ml/scripts/speech_to_text/06_create_vocab_list.py Extracts unique vocabulary from training phrases for model output space
ml/scripts/speech_to_text/07_compute_spectrograms.py Computes log-mel spectrograms and tokenizes transcriptions for model input
ml/scripts/speech_to_text/08_train_model.py Trains CNN+BiLSTM model with CTC loss using custom training loop
ml/scripts/speech_to_text/09_evaluate_model.py Evaluates model on validation set, computes WER, and saves predictions
ml/scripts/speech_to_text/10_evaluate_test_samples.py Creates ZIP of successfully recognized test samples for quality inspection

Comment on lines 21 to 26
parser = argparse.ArgumentParser(description="Train speech-to-text model.")
parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')
parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')
parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')
parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')
parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model')
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text for the --output-dir argument says "Directory for output model" but this script doesn't output a model. It outputs evaluation results (predictions and metrics). The help text should be "Directory for evaluation results" or similar.

Suggested change
parser = argparse.ArgumentParser(description="Train speech-to-text model.")
parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')
parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')
parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')
parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')
parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model')
parser = argparse.ArgumentParser(description="Evaluate speech-to-text model.")
parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')
parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')
parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')
parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')
parser.add_argument('--output-dir', type=Path, required=True, help='Directory for evaluation results (predictions and metrics)')

Copilot uses AI. Check for mistakes.
return log_S

def compute_tokens(vocab_list, transcription):
transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower())
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute_tokens function has an issue with vocabulary generation consistency. In 06_create_vocab_list.py, punctuation is replaced with spaces and words are tokenized, but in this function (line 48), punctuation is also replaced with spaces before tokenization. However, 06_create_vocab_list.py uses phrase.replace(',', ' ') which only replaces commas, while this uses re.sub(r"[^a-z0-9\s]", " ", ...) which removes all punctuation. This inconsistency means words might not be found in the vocabulary even when they should be present. Both scripts should use the same tokenization and normalization logic.

Suggested change
transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower())
transcription = transcription.lower().replace(',', ' ')

Copilot uses AI. Check for mistakes.
# Build the model
with open(paths.vocab, 'r', encoding='utf-8') as vocabfile:
vocab_list = [line.strip() for line in vocabfile if line.strip()]
num_classes = len(vocab_list) + 1 # +1 for CTC blank token
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CTC blank token index calculation is inconsistent across scripts. In this script (line 38), num_classes = len(vocab_list) + 1 is used, but in the evaluation scripts (09 and 10), the blank index is calculated as ctc_blank_idx = len(vocab_list) + 1 on line 59. However, in CTC, the blank token is typically at index len(vocab_list) (0-indexed), not len(vocab_list) + 1. This means the model is trained with one number of output classes but evaluated with a different blank index, which will cause mismatches. The correct calculation should be num_classes = len(vocab_list) + 1 (to include the blank), and ctc_blank_idx = len(vocab_list) (the last index).

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 60
ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token
print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}")
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CTC blank token index calculation is incorrect. Here ctc_blank_idx = len(vocab_list) + 1, but it should be ctc_blank_idx = len(vocab_list) to match the 0-indexed output from the model which has num_classes = len(vocab_list) + 1 dimensions. The blank token is conventionally at the last index, which would be len(vocab_list) when using 0-based indexing.

Suggested change
ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token
print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}")
ctc_blank_idx = len(vocab_list) # CTC blank token is conventionally at the last index
print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {len(vocab_list) + 1}")

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +42
input_layer = Input(shape=(n_mels, time_steps), name='input')
x = layers.Reshape((n_mels, time_steps, 1))(input_layer)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input shape for the model is defined as (n_mels, time_steps) which is (80, 360), but in 07_compute_spectrograms.py, the spectrogram is saved as log_S which has shape (n_mels, time_steps) = (80, 360) via librosa's melspectrogram output. However, the model then reshapes this to (n_mels, time_steps, 1) on line 42. This is correct for a 2D conv operation, but the input data is being loaded as-is without being reshaped. The training code loads spectrograms with np.load(spectrogram_file) and passes them directly to the model, which expects shape (80, 360). This should work, but verify that the data shape matches what the model expects during training.

Copilot uses AI. Check for mistakes.
import pandas as pd
import random
import os
import asyncio
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'asyncio' is not used.

Suggested change
import asyncio

Copilot uses AI. Check for mistakes.
import random
import os
import asyncio
import edge_tts
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'edge_tts' is not used.

Suggested change
import edge_tts

Copilot uses AI. Check for mistakes.
import os
import asyncio
import edge_tts
from tqdm import tqdm
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'tqdm' is not used.

Suggested change
from tqdm import tqdm

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,104 @@
import argparse
import csv
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'csv' is not used.

Suggested change
import csv

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 11
edge-tts
pandas
tqdm
pydub
soundfile
librosa
tensorflow
tensorflow.keras
onnx
tf2onnx
jiwer No newline at end of file
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ml/scripts/requirements.txt file lists multiple third-party Python packages without version pinning, which means pip install -r scripts/requirements.txt will always pull the latest, mutable versions from PyPI. This creates a supply-chain risk where a compromised or hijacked future release of any of these packages could execute attacker-controlled code in your training environment (with access to DVC credentials and artifacts). To reduce this risk, pin each dependency to a specific, known-good version (or immutable reference) and manage upgrades explicitly via review.

Copilot uses AI. Check for mistakes.
@jodavis
Copy link
Owner Author

jodavis commented Feb 4, 2026

@copilot address the code review comments in this PR.

Copy link
Contributor

Copilot AI commented Feb 4, 2026

@jodavis I've opened a new pull request, #78, to work on those changes. Once the pull request is ready, I'll request review from you.

…eline (#78)

* Initial plan

* Address code review comments - fix critical issues and clean up code

Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

* Update onnx to 1.17.0 to fix security vulnerabilities

Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>
@jodavis jodavis closed this Feb 5, 2026
@jodavis
Copy link
Owner Author

jodavis commented Feb 5, 2026

Abandoning this PR so we can focus on a phoneme-based implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants