Profile-first ML systems project optimizing a multi-camera end-to-end driving model for hardware efficiency using PyTorch, CUDA streams, NVTX instrumentation, and Nsight Systems.
Profile-first ML systems project focused on optimizing the hardware efficiency of a multi-camera end-to-end driving model.
This project simulates an autonomous driving-style input pipeline (8 cameras + command token) and demonstrates performance engineering techniques including:
- Batch-size scaling
- DataLoader worker tuning
- CUDA stream prefetching
- Pinned memory transfers
- NVTX instrumentation
- Nsight Systems profiling
- Mixed precision evaluation (AMP performance validation)
Enable hardware-efficient training with measurable GPU utilization improvements using systematic benchmarking and profiling.
Inspired by real-world ML systems roles in autonomous driving and large-scale model training infrastructure.
- Asynchronous data movement (CPU β GPU overlap)
- Throughput-driven batch tuning
- Identifying DataLoader multiprocessing overhead
- Profiling with NVIDIA Nsight Systems
- Empirical evaluation of AMP impact
- Clean checkpointing and reproducibility
| Batch Size | Throughput (samples/sec) |
|---|---|
| 8 | 126.9 |
| 16 | 147.7 |
| 32 | 148.0 (best) |
- num_workers=2 β slower (worker overhead dominates)
- pin_memory=True β slight improvement
- CUDA prefetcher β enabled overlap of H2D and compute
- AMP β significantly reduced throughput on GTX 1650
python src/train.py --batch 32 --steps 300 --num_workers 0
Final throughput: ~148 samples/sec
Nsight Systems used with NVTX markers: nsys profile -o profiles/e2e_driveperf --trace=cuda,nvtx python src/train.py --batch 16 --steps 120 --num_workers 0 --pin_memory
Key observation:
- GPU largely saturated during forward/backward
- Minimal idle gaps after warmup
- Worker multiprocessing introduced overhead on this workload
src/ dataset.py model.py train.py prefetch.py utils.py profiles/ e2e_driveperf.nsys-rep RESULTS.md
Create virtual environment and install PyTorch (CUDA-enabled), numpy, tqdm.
Then: python src/train.py --batch 32 --steps 300 --num_workers 0
Modern ML systems engineering requires more than model training β it requires:
- Understanding GPU utilization
- Identifying bottlenecks
- Designing async pipelines
- Measuring performance scientifically
This repository demonstrates those principles in a compact, reproducible form.