Implement the speech-to-text training pipeline by jodavis · Pull Request #77 · jodavis/AdaptiveRemote

jodavis · 2026-02-03T17:48:09Z

Use DVC to create a reproducible pipeline to generate speech samples and train a neural ML model for speech recognition, which can be tuned to the individual user and command space for Adaptive Remote.

github-actions · 2026-02-03T17:51:24Z

Test Results

308 tests ±0 304 ✅ ±0 1m 9s ⏱️ -43s
5 suites ±0 4 💤 ±0
5 files ±0 0 ❌ ±0

Results for commit dc48fbd. ± Comparison against base commit 6854c1d.

This pull request removes 3 and adds 2 tests. Note that renamed tests count towards both.

,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid

AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesKeyNameAsync (Hello
,False)
AdaptiveRemote.Services.ProgrammaticSettings.PersistSettingsTests ‑ PersistSettings_Set_ValidatesValueAsync (Invalid
,False)

Copilot

Pull request overview

This PR implements a comprehensive ML pipeline for speech-to-text model training using DVC (Data Version Control) for reproducibility. The pipeline generates synthetic training data from text phrases, applies audio augmentations (delays, background noise, microphone noise), trains a neural speech recognition model using TensorFlow/Keras with CTC loss, and evaluates the model's performance.

Changes:

DVC pipeline configuration with 11 stages covering data generation, augmentation, training, and evaluation
Python scripts for intent phrase generation with linguistic variations (pleasantries, hesitations, spelling variants)
Speech synthesis pipeline using edge-tts to generate audio samples with randomized voice characteristics
Audio augmentation scripts for adding realistic noise and delays to improve model robustness
Neural model training with CNN+BiLSTM architecture and CTC loss for sequence-to-sequence learning
Model evaluation scripts computing Word Error Rate (WER) and generating prediction reports
Documentation describing the ML pipeline design and usage

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 33 comments.

Show a summary per file

File	Description
ml/dvc.yaml	Defines 11-stage DVC pipeline orchestrating all data preparation, training, and evaluation steps
ml/dvc.lock	Lock file tracking exact versions and hashes of all pipeline dependencies and outputs
ml/_doc_ml.md	Comprehensive documentation of ML pipeline architecture, stages, and developer quickstart guide
.dvc/config	DVC remote configuration pointing to S3 bucket for artifact storage
.dvc/.gitignore	Git ignore rules for DVC local cache and temporary files
.dvcignore	DVC-specific ignore patterns for performance optimization
ml/scripts/requirements.txt	Python package dependencies for ML pipeline execution
ml/scripts/intent_prediction/01_input_phrases.csv	Base command phrases mapped to canonical labels for intent classification
ml/scripts/intent_prediction/01_generate_phrases.py	Generates 10,000 phrase variations with linguistic transformations for robust training
ml/scripts/speech_to_text/01a_generate_speech_sample_variations.py	Creates TTS parameter variations (voice, rate) for each phrase
ml/scripts/speech_to_text/01_generate_speech_samples.py	Synthesizes audio samples using edge-tts with randomized characteristics
ml/scripts/speech_to_text/02a_randomize_delay_variations.py	Generates random prefix/suffix silence durations for each sample
ml/scripts/speech_to_text/02_add_delays.py	Applies silence padding to audio files based on randomized delays
ml/scripts/speech_to_text/03a_download_background_noise.py	Downloads realistic background noise samples from Freesound
ml/scripts/speech_to_text/03_add_background_noise.py	Mixes background noise into audio samples at random volumes
ml/scripts/speech_to_text/04_add_microphone_noise.py	Adds synthetic microphone noise to simulate recording conditions
ml/scripts/speech_to_text/05_create_set_manifests.py	Splits augmented audio into train/val/test sets and creates manifest CSVs
ml/scripts/speech_to_text/06_create_vocab_list.py	Extracts unique vocabulary from training phrases for model output space
ml/scripts/speech_to_text/07_compute_spectrograms.py	Computes log-mel spectrograms and tokenizes transcriptions for model input
ml/scripts/speech_to_text/08_train_model.py	Trains CNN+BiLSTM model with CTC loss using custom training loop
ml/scripts/speech_to_text/09_evaluate_model.py	Evaluates model on validation set, computes WER, and saves predictions
ml/scripts/speech_to_text/10_evaluate_test_samples.py	Creates ZIP of successfully recognized test samples for quality inspection

Copilot · 2026-02-03T17:56:56Z

ml/scripts/speech_to_text/09_evaluate_model.py

+parser = argparse.ArgumentParser(description="Train speech-to-text model.")
+parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')
+parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')
+parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')
+parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')
+parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model')


The help text for the --output-dir argument says "Directory for output model" but this script doesn't output a model. It outputs evaluation results (predictions and metrics). The help text should be "Directory for evaluation results" or similar.

Suggested change

parser = argparse.ArgumentParser(description="Train speech-to-text model.")

parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')

parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')

parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')

parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')

parser.add_argument('--output-dir', type=Path, required=True, help='Directory for output model')

parser = argparse.ArgumentParser(description="Evaluate speech-to-text model.")

parser.add_argument('--manifest', type=Path, required=True, help='Path to val_manifest.csv')

parser.add_argument('--model', type=Path, required=True, help='Path to model file (speech_to_text_model.keras)')

parser.add_argument('--vocab', type=Path, required=True, help='Path to vocab_list.txt')

parser.add_argument('--spectrogram-dir', type=Path, required=True, help='Directory with spectrogram npy files')

parser.add_argument('--output-dir', type=Path, required=True, help='Directory for evaluation results (predictions and metrics)')

Copilot · 2026-02-03T17:56:56Z

ml/scripts/speech_to_text/07_compute_spectrograms.py

+    return log_S
+
+def compute_tokens(vocab_list, transcription):
+    transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower())


The compute_tokens function has an issue with vocabulary generation consistency. In 06_create_vocab_list.py, punctuation is replaced with spaces and words are tokenized, but in this function (line 48), punctuation is also replaced with spaces before tokenization. However, 06_create_vocab_list.py uses phrase.replace(',', ' ') which only replaces commas, while this uses re.sub(r"[^a-z0-9\s]", " ", ...) which removes all punctuation. This inconsistency means words might not be found in the vocabulary even when they should be present. Both scripts should use the same tokenization and normalization logic.

Suggested change

transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower())

transcription = transcription.lower().replace(',', ' ')

Copilot · 2026-02-03T17:56:57Z

ml/scripts/speech_to_text/08_train_model.py

+# Build the model
+with open(paths.vocab, 'r', encoding='utf-8') as vocabfile:
+    vocab_list = [line.strip() for line in vocabfile if line.strip()]
+    num_classes = len(vocab_list) + 1  # +1 for CTC blank token


The CTC blank token index calculation is inconsistent across scripts. In this script (line 38), num_classes = len(vocab_list) + 1 is used, but in the evaluation scripts (09 and 10), the blank index is calculated as ctc_blank_idx = len(vocab_list) + 1 on line 59. However, in CTC, the blank token is typically at index len(vocab_list) (0-indexed), not len(vocab_list) + 1. This means the model is trained with one number of output classes but evaluated with a different blank index, which will cause mismatches. The correct calculation should be num_classes = len(vocab_list) + 1 (to include the blank), and ctc_blank_idx = len(vocab_list) (the last index).

Copilot · 2026-02-03T17:56:57Z

ml/scripts/speech_to_text/10_evaluate_test_samples.py

+    ctc_blank_idx = len(vocab_list) + 1  # +1 for CTC blank token
+    print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}")


The CTC blank token index calculation is incorrect. Here ctc_blank_idx = len(vocab_list) + 1, but it should be ctc_blank_idx = len(vocab_list) to match the 0-indexed output from the model which has num_classes = len(vocab_list) + 1 dimensions. The blank token is conventionally at the last index, which would be len(vocab_list) when using 0-based indexing.

Suggested change

ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token

print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}")

ctc_blank_idx = len(vocab_list) # CTC blank token is conventionally at the last index

print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {len(vocab_list) + 1}")

Copilot · 2026-02-03T17:56:57Z

ml/scripts/speech_to_text/08_train_model.py

+input_layer = Input(shape=(n_mels, time_steps), name='input')
+x = layers.Reshape((n_mels, time_steps, 1))(input_layer)


The input shape for the model is defined as (n_mels, time_steps) which is (80, 360), but in 07_compute_spectrograms.py, the spectrogram is saved as log_S which has shape (n_mels, time_steps) = (80, 360) via librosa's melspectrogram output. However, the model then reshapes this to (n_mels, time_steps, 1) on line 42. This is correct for a 2D conv operation, but the input data is being loaded as-is without being reshaped. The training code loads spectrograms with np.load(spectrogram_file) and passes them directly to the model, which expects shape (80, 360). This should work, but verify that the data shape matches what the model expects during training.

Copilot · 2026-02-03T17:57:03Z

ml/scripts/speech_to_text/02a_randomize_delay_variations.py

+import pandas as pd
+import random
+import os
+import asyncio


Import of 'asyncio' is not used.

Suggested change

import asyncio

Copilot · 2026-02-03T17:57:03Z

ml/scripts/speech_to_text/02a_randomize_delay_variations.py

+import random
+import os
+import asyncio
+import edge_tts


Import of 'edge_tts' is not used.

Suggested change

import edge_tts

Copilot · 2026-02-03T17:57:03Z

ml/scripts/speech_to_text/02a_randomize_delay_variations.py

+import os
+import asyncio
+import edge_tts
+from tqdm import tqdm


Import of 'tqdm' is not used.

Suggested change

from tqdm import tqdm

Copilot · 2026-02-03T17:57:03Z

ml/scripts/speech_to_text/08_train_model.py

@@ -0,0 +1,104 @@
+import argparse
+import csv


Import of 'csv' is not used.

Suggested change

import csv

Copilot · 2026-02-03T17:57:04Z

ml/scripts/requirements.txt

+edge-tts
+pandas
+tqdm
+pydub
+soundfile
+librosa
+tensorflow
+tensorflow.keras
+onnx
+tf2onnx
+jiwer


The ml/scripts/requirements.txt file lists multiple third-party Python packages without version pinning, which means pip install -r scripts/requirements.txt will always pull the latest, mutable versions from PyPI. This creates a supply-chain risk where a compromised or hijacked future release of any of these packages could execute attacker-controlled code in your training environment (with access to DVC credentials and artifacts). To reduce this risk, pin each dependency to a specific, known-good version (or immutable reference) and manage upgrades explicitly via review.

jodavis · 2026-02-04T06:21:58Z

@copilot address the code review comments in this PR.

Copilot · 2026-02-04T06:22:06Z

@jodavis I've opened a new pull request, #78, to work on those changes. Once the pull request is ready, I'll request review from you.

…eline (#78) * Initial plan * Address code review comments - fix critical issues and clean up code Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com> * Update onnx to 1.17.0 to fix security vulnerabilities Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

jodavis · 2026-02-05T16:58:10Z

Abandoning this PR so we can focus on a phoneme-based implementation

Implement the speech-to-text training pipeline

dc48fbd

jodavis requested a review from Copilot February 3, 2026 17:48

Copilot started reviewing on behalf of jodavis February 3, 2026 17:48 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 4, 2026

Fix CTC loss implementation and address code quality issues in ML pipeline #78

Merged

jodavis closed this Feb 5, 2026

	transcription = re.sub(r"[^a-z0-9\s]", " ", transcription.lower())
	transcription = transcription.lower().replace(',', ' ')

		ctc_blank_idx = len(vocab_list) + 1 # +1 for CTC blank token
		print(f"Vocabulary size: {len(vocab_list)}, Number of classes (with CTC blank): {ctc_blank_idx}")

		input_layer = Input(shape=(n_mels, time_steps), name='input')
		x = layers.Reshape((n_mels, time_steps, 1))(input_layer)

Conversation

jodavis commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

jodavis commented Feb 4, 2026

Uh oh!

Copilot AI commented Feb 4, 2026

Uh oh!

jodavis commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants