Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
4 changes: 4 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[core]
remote = adr-ml-training-data
['remote "adr-ml-training-data"']
url = s3://adr-ml-training-data/dvc
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
66 changes: 59 additions & 7 deletions ml/_doc_ml.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,62 @@
Speech Recognition Models
# ML Pipeline Design — AdaptiveRemote (current implementation)

=========================
## Purpose and Scope

## Folders
/ml/scripts -> Python scripts for implementing speech recognition training and evaluation.
/ml/data -> Contains datasets used for training and evaluating speech recognition models.
/ml/notebooks -> Jupyter notebooks for experimenting with different speech recognition techniques.
/ml/models -> Pre-trained speech recognition models and scripts for training new models.
This document describes the ML pipeline in the `ml` folder. It includes DVC stages, scripts, inputs and outputs, how to run it locally for development, and a short set of next steps. The pipeline is implemented for local Windows development and uses DVC to manage data and artifacts.

## Repository layout (relevant paths)

- `ml/dvc.yaml` — pipeline orchestration and stage definitions.
- `ml/scripts/intent_prediction/` — scripts to generate intent phrase variations (`01_generate_phrases.py` and `01_input_phrases.csv`).
- `ml/scripts/speech_to_text/` — speech sample generation, augmentation, featurization, training, and evaluation scripts (`01*`–`09*`).
- `ml/data/` — DVC-tracked raw, intermediate, and output artifacts (manifests, spectrograms, models). This data is managed by DVC, not committed to the Git repo.

## Implemented DVC stages and scripts

The pipeline for both speech-to-text and intent prediction is defined with explicit stages in [`ml/dvc.yaml`](./dvc.yaml). See that file for stage names, inputs, and outputs.

Each stage in `dvc.yaml` declares the exact command, dependencies, and outputs used by the pipeline. The scripts follow a consistent CLI convention (required `--input`/`--manifest`/`--output` arguments) and each stage saves outputs into its own folder in the the `ml/data` tree.

## Implementation details (summary)

- Intent generation: `01_generate_phrases.py` reads `01_input_phrases.csv`, synthesizes surface-form variations (adding pleasantries, hesitations, spelling variants, and repeats), and writes `training_data.csv` used as input to both intent prediction and speech sample generation.
- Speech sample generation and augmentation: scripts `01*`–`04*` in speech_to_text generate TTS or synthetic samples, randomize delays, add background and microphone noise, and write clean/noisy audio files into DVC-backed directories.
- Manifests and vocab: `05_create_set_manifests.py` creates train/val CSV manifests referencing audio filepaths and expected transcripts; `06_create_vocab_list.py` builds `vocab_list.txt`.
- Featurization: `07_compute_spectrograms.py` reads manifests and `vocab_list.txt`, computes fixed-size log-Mel spectrograms and token arrays (`*.npy`), and stores them in the spectrogram output directory.
- Training: `08_train_model.py` loads training manifest and spectrogram/token `.npy` files, constructs a Keras model (Conv2D → BiLSTM → Dense), trains with a CTC-style loss loop, and saves `speech_to_text_model.keras`.
- Evaluation: `09_evaluate_model.py` loads the saved model, runs greedy CTC decoding on eval spectrograms, computes WER using `jiwer`, and writes an `evaluation_predictions.txt` report.

## Dependencies and environment

- See `ml/scripts/requirements.txt` for required Python packages.
- The code is written to run on Windows for development (CPU TensorFlow). Model training and larger-scale runs can be moved to Linux GPU hosts with minimal changes.

## How to run (developer quickstart)

1. Ensure DVC is installed and configured for your environment.
2. From the repository root:

```powershell
cd ml
pip install -r scripts/requirements.txt
dvc pull
dvc repro
```

`dvc repro` will execute the defined stages in the correct order and populate `ml/data` with outputs. Inspect the `dvc.yaml` file for per-stage commands and I/O when you need to run or debug a particular step.

## Observations and current limitations

- The training loop in `08_train_model.py` implements a simple custom training loop with CTC loss; model hyperparameters (epochs, batch_size, time_steps) are hardcoded and could be parameterized.
- The dataset generation is entirely file-based; large intermediate artifacts (audio, spectrograms, models) are stored under `ml/data` and should be pushed to the DVC remote to share across machines.
- There is a simple greedy CTC decoder in `09_evaluate_model.py`; for production accuracy reporting, a beam search decoder could be added.
- Scripts assume certain manifest and file conventions (manifest columns include `filepath` and `speech_to_detect`). Changes to manifest format will require updating `07*`, `08*`, and `09*` scripts.

## Next steps (practical, minimal changes)

- Parameterize training and featurization hyperparameters via CLI args or a YAML config to avoid editing source for experiments.
- Pin dependency versions in `ml/scripts/requirements.txt` (or add `requirements.lock`) for reproducibility.
- Add a small-sample smoke dataset and a CI job that runs `dvc pull` + `dvc repro` on that sample to detect regressions.
- Add a minimal `model_registry.json` (template) that records model metadata (train commit, metrics, DVC path) when a training run completes.
- Add a simple cleanup helper to remove local intermediate files not referenced by DVC.

Loading