Waymo-e2e-profiler

Profile-first ML systems project optimizing a multi-camera end-to-end driving model for hardware efficiency using PyTorch, CUDA streams, NVTX instrumentation, and Nsight Systems.

E2E-DrivePerf

Profile-first ML systems project focused on optimizing the hardware efficiency of a multi-camera end-to-end driving model.

This project simulates an autonomous driving-style input pipeline (8 cameras + command token) and demonstrates performance engineering techniques including:

Batch-size scaling
DataLoader worker tuning
CUDA stream prefetching
Pinned memory transfers
NVTX instrumentation
Nsight Systems profiling
Mixed precision evaluation (AMP performance validation)

🎯 Goal

Enable hardware-efficient training with measurable GPU utilization improvements using systematic benchmarking and profiling.

Inspired by real-world ML systems roles in autonomous driving and large-scale model training infrastructure.

🧠 Key Engineering Concepts Demonstrated

Asynchronous data movement (CPU → GPU overlap)
Throughput-driven batch tuning
Identifying DataLoader multiprocessing overhead
Profiling with NVIDIA Nsight Systems
Empirical evaluation of AMP impact
Clean checkpointing and reproducibility

📊 Results (GTX 1650)

Batch Sweep (num_workers=0)

Batch Size	Throughput (samples/sec)
8	126.9
16	147.7
32	148.0 (best)

Input Pipeline Experiments (batch=16)

num_workers=2 → slower (worker overhead dominates)
pin_memory=True → slight improvement
CUDA prefetcher → enabled overlap of H2D and compute
AMP → significantly reduced throughput on GTX 1650

Best Config

python src/train.py --batch 32 --steps 300 --num_workers 0

Final throughput: ~148 samples/sec

🔬 Profiling

Nsight Systems used with NVTX markers: nsys profile -o profiles/e2e_driveperf --trace=cuda,nvtx python src/train.py --batch 16 --steps 120 --num_workers 0 --pin_memory

Key observation:

GPU largely saturated during forward/backward
Minimal idle gaps after warmup
Worker multiprocessing introduced overhead on this workload

🧱 Project Structure

src/ dataset.py model.py train.py prefetch.py utils.py profiles/ e2e_driveperf.nsys-rep RESULTS.md

🚀 How To Run

Create virtual environment and install PyTorch (CUDA-enabled), numpy, tqdm.

Then: python src/train.py --batch 32 --steps 300 --num_workers 0

📌 Why This Project Matters

Modern ML systems engineering requires more than model training — it requires:

Understanding GPU utilization
Identifying bottlenecks
Designing async pipelines
Measuring performance scientifically

This repository demonstrates those principles in a compact, reproducible form.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
e2e-driveperf		e2e-driveperf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Waymo-e2e-profiler

E2E-DrivePerf

🎯 Goal

🧠 Key Engineering Concepts Demonstrated

📊 Results (GTX 1650)

Batch Sweep (num_workers=0)

Input Pipeline Experiments (batch=16)

Best Config

🔬 Profiling

🧱 Project Structure

🚀 How To Run

📌 Why This Project Matters

About

Uh oh!

Releases

Packages

Languages

kuttivicky/Waymo-e2e-profiler

Folders and files

Latest commit

History

Repository files navigation

Waymo-e2e-profiler

E2E-DrivePerf

🎯 Goal

🧠 Key Engineering Concepts Demonstrated

📊 Results (GTX 1650)

Batch Sweep (num_workers=0)

Input Pipeline Experiments (batch=16)

Best Config

🔬 Profiling

🧱 Project Structure

🚀 How To Run

📌 Why This Project Matters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages