Skip to content

Fix CTC loss implementation and address code quality issues in ML pipeline#78

Merged
jodavis merged 3 commits intodev/jodavis/ADR-50-speech-to-textfrom
copilot/sub-pr-77
Feb 5, 2026
Merged

Fix CTC loss implementation and address code quality issues in ML pipeline#78
jodavis merged 3 commits intodev/jodavis/ADR-50-speech-to-textfrom
copilot/sub-pr-77

Conversation

Copy link
Contributor

Copilot AI commented Feb 4, 2026

Addressed 33 code review findings including critical bugs in the speech-to-text training pipeline's CTC loss implementation and multiple code quality issues.

Critical Fixes

CTC blank token index: Corrected calculation from len(vocab_list) + 1 to len(vocab_list) across training and evaluation scripts. The blank token is at the last index (0-based), not one beyond it.

CTC label lengths: Fixed to compute actual sequence lengths by counting non-padding tokens instead of using the full padded length:

# Before: incorrect - treats padding as valid labels
lbl_len_reshaped = tf.fill([tf.shape(y_batch)[0], 1], tf.shape(y_batch)[1])

# After: correct - counts actual tokens
lbl_len = tf.math.count_nonzero(y_batch, axis=1, dtype=tf.int32)
lbl_len_reshaped = tf.expand_dims(lbl_len, axis=1)

Tokenization consistency: Aligned preprocessing between 06_create_vocab_list.py (comma replacement) and 07_compute_spectrograms.py (was using regex that removed all punctuation).

Code Quality

  • Removed unused imports: csv, tensorflow.keras layers/Model/Input, jiwer.wer, zipfile, xml.etree, asyncio, edge_tts, tqdm (10+ occurrences)
  • Removed unused variables: input_output_pairs, ctc_loss_fn function
  • Fixed tabs-to-spaces indentation in 03_add_background_noise.py
  • Fixed list.remove(index)list.pop(index) in variation generator
  • Fixed return type annotation: -> (set | List[Dict])-> Tuple[set, List[Dict]]
  • Fixed generate_samples() to track actual count instead of enumerate index
  • Corrected argparse descriptions (evaluation scripts said "Train" instead of "Evaluate")

Reproducibility & Security

  • Added random.seed(42) before train/val/test split for reproducible data partitioning
  • Pinned all dependency versions in requirements.txt
  • Updated onnx 1.15.0 → 1.17.0 (fixes path traversal, arbitrary file overwrite, directory traversal vulnerabilities)

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits February 4, 2026 06:25
Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>
Co-authored-by: jodavis <6740581+jodavis@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement speech-to-text training pipeline Fix CTC loss implementation and address code quality issues in ML pipeline Feb 4, 2026
Copilot AI requested a review from jodavis February 4, 2026 06:28
@jodavis jodavis marked this pull request as ready for review February 5, 2026 16:56
@jodavis jodavis merged commit 7f369be into dev/jodavis/ADR-50-speech-to-text Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants