The Epstein Files

Scraper and downloader for the DOJ Epstein Files disclosures (Data Sets 1-12).

Uses Selenium (headless Chromium) to maintain a live browser session, avoiding token expiration issues. A SQLite database tracks scrape/download state so interrupted operations resume cleanly.

Roadmap

Implement scraping of PDF URLs from the DOJ website
Analyze PDF with Llava-CoT for downstream processing
OCR PDFs with Nanonets v2 for structured data extraction
Inference redacted inline text for list of candidates (names, locations, dates) using LLaMA-3.1-70B-Instruct
Generate knowledge graph of entities and relationships using LLaMA-3.1-70B-Instruct

Usage

Install locally

pip install .

Commands

# Discover PDF URLs from the DOJ website
epstein-files scrape
epstein-files scrape -d 1          # scrape only Data Set 1

# Download discovered PDFs
epstein-files download
epstein-files download -d 1        # download only Data Set 1
epstein-files download --retry-failed

# View status summary
epstein-files status
epstein-files status -d 1

Docker

docker build -t epstein-files .

# Scrape
docker run -v $(pwd)/data:/app/data epstein-files scrape

# Download
docker run -v $(pwd)/data:/app/data epstein-files download

# Status
docker run -v $(pwd)/data:/app/data epstein-files status

Project Structure

src/epstein_files/
  cli.py           # Click CLI entry point
  config.py        # Constants (URLs, delays, etc.)
  db.py            # SQLite database layer
  models.py        # Dataclasses
  browser.py       # Selenium browser management
  scraper.py       # Page scraping (Selenium + BS4)
  downloader.py    # PDF download logic (Selenium CDP)

Downloaded files are stored in data/ (gitignored), organized as data/DataSet_N/.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/epstein_files		src/epstein_files
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Epstein Files

Roadmap

Usage

Install locally

Commands

Docker

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

JamesMTucker/EpsteinFiles

Folders and files

Latest commit

History

Repository files navigation

The Epstein Files

Roadmap

Usage

Install locally

Commands

Docker

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages