Skip to content

Profile-first ML systems project optimizing a multi-camera end-to-end driving model for hardware efficiency using PyTorch, CUDA streams, NVTX instrumentation, and Nsight Systems.

Notifications You must be signed in to change notification settings

kuttivicky/Waymo-e2e-profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Waymo-e2e-profiler

Profile-first ML systems project optimizing a multi-camera end-to-end driving model for hardware efficiency using PyTorch, CUDA streams, NVTX instrumentation, and Nsight Systems.

E2E-DrivePerf

Profile-first ML systems project focused on optimizing the hardware efficiency of a multi-camera end-to-end driving model.

This project simulates an autonomous driving-style input pipeline (8 cameras + command token) and demonstrates performance engineering techniques including:

  • Batch-size scaling
  • DataLoader worker tuning
  • CUDA stream prefetching
  • Pinned memory transfers
  • NVTX instrumentation
  • Nsight Systems profiling
  • Mixed precision evaluation (AMP performance validation)

🎯 Goal

Enable hardware-efficient training with measurable GPU utilization improvements using systematic benchmarking and profiling.

Inspired by real-world ML systems roles in autonomous driving and large-scale model training infrastructure.


🧠 Key Engineering Concepts Demonstrated

  • Asynchronous data movement (CPU β†’ GPU overlap)
  • Throughput-driven batch tuning
  • Identifying DataLoader multiprocessing overhead
  • Profiling with NVIDIA Nsight Systems
  • Empirical evaluation of AMP impact
  • Clean checkpointing and reproducibility

πŸ“Š Results (GTX 1650)

Batch Sweep (num_workers=0)

Batch Size Throughput (samples/sec)
8 126.9
16 147.7
32 148.0 (best)

Input Pipeline Experiments (batch=16)

  • num_workers=2 β†’ slower (worker overhead dominates)
  • pin_memory=True β†’ slight improvement
  • CUDA prefetcher β†’ enabled overlap of H2D and compute
  • AMP β†’ significantly reduced throughput on GTX 1650

Best Config

python src/train.py --batch 32 --steps 300 --num_workers 0

Final throughput: ~148 samples/sec


πŸ”¬ Profiling

Nsight Systems used with NVTX markers: nsys profile -o profiles/e2e_driveperf --trace=cuda,nvtx python src/train.py --batch 16 --steps 120 --num_workers 0 --pin_memory

Key observation:

  • GPU largely saturated during forward/backward
  • Minimal idle gaps after warmup
  • Worker multiprocessing introduced overhead on this workload

🧱 Project Structure

src/ dataset.py model.py train.py prefetch.py utils.py profiles/ e2e_driveperf.nsys-rep RESULTS.md


πŸš€ How To Run

Create virtual environment and install PyTorch (CUDA-enabled), numpy, tqdm.

Then: python src/train.py --batch 32 --steps 300 --num_workers 0


πŸ“Œ Why This Project Matters

Modern ML systems engineering requires more than model training β€” it requires:

  • Understanding GPU utilization
  • Identifying bottlenecks
  • Designing async pipelines
  • Measuring performance scientifically

This repository demonstrates those principles in a compact, reproducible form.

About

Profile-first ML systems project optimizing a multi-camera end-to-end driving model for hardware efficiency using PyTorch, CUDA streams, NVTX instrumentation, and Nsight Systems.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages