A high-performance PPTX parser written in Rust. Extracts structured, semantically typed text, metadata, and images from PowerPoint files using parallel slide processing.
- Semantic element types — classifies text as Title, Subtitle, Paragraph, or ListItem with indent depth
- Heading detection — identifies title/subtitle placeholder shapes from slide XML
- Bullet/list detection — detects
buChar,buAutoNum,buBlipmarkers with nesting depth - Markdown output — renders presentations as clean Markdown with YAML front matter
- Parallel processing — slides are parsed concurrently using Rayon
- Image extraction — extracts images from
ppt/media/and maps them to the correct slide via relationship files - Metadata parsing — title, author, dates, application info from
docProps/ - Fault tolerant — corrupted slides are skipped; partial documents are returned
- Triple output — plain text, JSON, or Markdown
cargo install --path .Or add as a dependency:
[dependencies]
rustypptx = { path = "." }# Plain text output (with element type annotations)
rustypptx presentation.pptx
# JSON output (includes element_type and depth fields)
rustypptx presentation.pptx --json
# Markdown output
rustypptx presentation.pptx --markdown
# Extract images to a directory
rustypptx presentation.pptx --output-dir ./images
# Combine flags
rustypptx presentation.pptx --markdown --output-dir ./images---
title: "Q4 Review"
author: "Jane Doe"
created: "2024-12-01T10:00:00Z"
---
## Slide 1: Quarterly Results
### Financial Overview
Revenue grew 15% year-over-year.
- North America: $12M
- US: $9M
- Canada: $3M
- Europe: $8M

---
## Slide 2: Next Steps
Action items for Q1.
- Expand into APAC
- Hire 20 engineersuse std::path::Path;
use rustypptx::ElementType;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let doc = rustypptx::parse_pptx(Path::new("presentation.pptx"))?;
// Markdown output
println!("{}", doc.to_markdown());
// Or work with structured data
for slide in &doc.slides {
for elem in &slide.text_elements {
match elem.element_type {
ElementType::Title => println!("# {}", elem.text),
ElementType::Subtitle => println!("## {}", elem.text),
ElementType::ListItem => {
let indent = " ".repeat(elem.depth.unwrap_or(0) as usize);
println!("{}- {}", indent, elem.text);
}
ElementType::Paragraph => println!("{}", elem.text),
}
}
}
Ok(())
}You can also parse from bytes directly:
let bytes = std::fs::read("presentation.pptx")?;
let doc = rustypptx::parse_pptx_bytes(&bytes)?;The JSON output includes semantic type information for each text element:
{
"element_type": "list_item",
"text": "Sub-bullet item",
"depth": 1
}Element types: title, subtitle, paragraph, list_item. The depth field is only present for list_item elements (0-based indent level).
src/
├── lib.rs # Public API
├── main.rs # CLI (clap)
├── error.rs # Error types
├── model.rs # PptxDocument, Slide, TextElement, ElementType, ImageRef
├── parser.rs # Orchestrator — unzip, parallel dispatch
├── metadata.rs # docProps/core.xml & app.xml parsing
├── relationships.rs # .rels file parsing (rId → media path)
├── slides.rs # Slide XML parsing — text, placeholders, bullets, images
├── markdown.rs # Markdown rendering
└── images.rs # Media file extraction from archive
- The entire PPTX (ZIP archive) is read into memory
- Metadata is parsed from
docProps/core.xmlanddocProps/app.xml - All media files under
ppt/media/are extracted - Slide XML and their
.relsfiles are loaded - Slides are parsed in parallel via Rayon — each worker runs its own
quick-xmlreader - Placeholder shapes (
<p:ph type="title"/>, etc.) are detected to classify text semantically - Bullet markers (
buChar,buAutoNum,buBlip) and indent levels are tracked per paragraph - Image relationship IDs (
r:embed="rId2") are resolved to archive paths and matched with extracted media bytes
cargo testMIT