Fix CTC loss implementation and address code quality issues in ML pipeline by Copilot · Pull Request #78 · jodavis/AdaptiveRemote

Copilot · 2026-02-04T06:22:04Z

Addressed 33 code review findings including critical bugs in the speech-to-text training pipeline's CTC loss implementation and multiple code quality issues.

Critical Fixes

CTC blank token index: Corrected calculation from len(vocab_list) + 1 to len(vocab_list) across training and evaluation scripts. The blank token is at the last index (0-based), not one beyond it.

CTC label lengths: Fixed to compute actual sequence lengths by counting non-padding tokens instead of using the full padded length:

# Before: incorrect - treats padding as valid labels
lbl_len_reshaped = tf.fill([tf.shape(y_batch)[0], 1], tf.shape(y_batch)[1])

# After: correct - counts actual tokens
lbl_len = tf.math.count_nonzero(y_batch, axis=1, dtype=tf.int32)
lbl_len_reshaped = tf.expand_dims(lbl_len, axis=1)

Tokenization consistency: Aligned preprocessing between 06_create_vocab_list.py (comma replacement) and 07_compute_spectrograms.py (was using regex that removed all punctuation).

Code Quality

Removed unused imports: csv, tensorflow.keras layers/Model/Input, jiwer.wer, zipfile, xml.etree, asyncio, edge_tts, tqdm (10+ occurrences)
Removed unused variables: input_output_pairs, ctc_loss_fn function
Fixed tabs-to-spaces indentation in 03_add_background_noise.py
Fixed list.remove(index) → list.pop(index) in variation generator
Fixed return type annotation: -> (set | List[Dict]) → -> Tuple[set, List[Dict]]
Fixed generate_samples() to track actual count instead of enumerate index
Corrected argparse descriptions (evaluation scripts said "Train" instead of "Evaluate")

Reproducibility & Security

Added random.seed(42) before train/val/test split for reproducible data partitioning
Pinned all dependency versions in requirements.txt
Updated onnx 1.15.0 → 1.17.0 (fixes path traversal, arbitrary file overwrite, directory traversal vulnerabilities)

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

Initial plan

7b14c0c

Copilot AI assigned Copilot and jodavis Feb 4, 2026

Copilot AI mentioned this pull request Feb 4, 2026

Implement the speech-to-text training pipeline #77

Closed

Copilot started work on behalf of jodavis February 4, 2026 06:22 View session

Copilot AI and others added 2 commits February 4, 2026 06:25

Address code review comments - fix critical issues and clean up code

68c4f10

Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

Update onnx to 1.17.0 to fix security vulnerabilities

92350d2

Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement speech-to-text training pipeline~~ Fix CTC loss implementation and address code quality issues in ML pipeline Feb 4, 2026

Copilot AI requested a review from jodavis February 4, 2026 06:28

Copilot finished work on behalf of jodavis February 4, 2026 06:28

jodavis approved these changes Feb 5, 2026

View reviewed changes

jodavis marked this pull request as ready for review February 5, 2026 16:56

jodavis merged commit 7f369be into dev/jodavis/ADR-50-speech-to-text Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CTC loss implementation and address code quality issues in ML pipeline#78

Fix CTC loss implementation and address code quality issues in ML pipeline#78
jodavis merged 3 commits intodev/jodavis/ADR-50-speech-to-textfrom
copilot/sub-pr-77

Copilot AI commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical Fixes

Code Quality

Reproducibility & Security

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 4, 2026 •

edited

Loading