A production-ready data pipeline for processing Polygon.io S3 flat files into optimized formats for quantitative analysis and machine learning.
- Command-Line Interface: Complete CLI for all operations (
quantminicommand) - Adaptive Processing: Automatically scales from 24GB workstations to 100GB+ servers
- 70%+ Compression: Optimized Parquet and binary formats
- Sub-Second Queries: Partitioned data lake with predicate pushdown
- Incremental Updates: Process only new data using watermarks
- Apple Silicon Optimized: 2-3x faster on M1/M2/M3 chips
- Production Ready: Monitoring, alerting, validation, and error recovery
| Mode | Memory | Throughput | With Optimizations |
|---|---|---|---|
| Streaming | < 32GB | 100K rec/s | 500K rec/s |
| Batch | 32-64GB | 200K rec/s | 1M rec/s |
| Parallel | > 64GB | 500K rec/s | 2M rec/s |
- macOS (Apple Silicon or Intel) or Linux
- Python 3.10+
- 24GB+ RAM (recommended: 32GB+)
- 1TB+ storage (SSD recommended)
- Polygon.io account with S3 flat files access
- Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone and setup:
git clone <repository-url>
cd quantmini
# Create project structure
./create_structure.sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate # On macOS/Linux- Install dependencies:
uv pip install qlib polygon boto3 aioboto3 polars duckdb pyarrow psutil pyyaml- Configure credentials:
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with your Polygon API keys- Run system profiler:
python -m src.core.system_profiler
# This will create config/system_profile.yaml# Initialize configuration
quantmini config init
# Edit credentials (add your Polygon.io API keys)
nano config/credentials.yaml
# Run daily pipeline
quantmini pipeline daily --data-type stocks_daily
# Or backfill historical data
quantmini pipeline run --data-type stocks_daily --start-date 2024-01-01 --end-date 2024-12-31
# Query data
quantmini data query --data-type stocks_daily \
--symbols AAPL MSFT \
--fields date close volume \
--start-date 2024-01-01 --end-date 2024-01-31See CLI.md for complete CLI documentation.
quantmini/
βββ config/ # Configuration files
βββ src/ # Source code
β βββ core/ # System profiling, memory monitoring
β βββ download/ # S3 downloaders
β βββ ingest/ # Data ingestion (landing β bronze)
β βββ storage/ # Parquet storage management
β βββ features/ # Feature engineering (bronze β silver)
β βββ transform/ # Binary conversion (silver β gold)
β βββ query/ # Query engine
β βββ orchestration/ # Pipeline orchestration
βββ data/ # Data storage (not in git)
β βββ landing/ # Landing layer: raw source data
β β βββ polygon-s3/ # CSV.GZ files from S3
β βββ bronze/ # Bronze layer: validated Parquet
β βββ silver/ # Silver layer: feature-enriched Parquet
β βββ gold/ # Gold layer: ML-ready formats
β β βββ qlib/ # Qlib binary format
β βββ metadata/ # Watermarks, indexes
βββ scripts/ # Command-line scripts
βββ tests/ # Test suite
βββ docs/ # Documentation
Edit config/pipeline_config.yaml to customize:
- Processing mode:
adaptive,streaming,batch, orparallel - Data types: Enable/disable stocks, options, daily, minute data
- Compression: Choose
snappy(fast) orzstd(better compression) - Features: Configure which features to compute
- Optimizations: Enable Apple Silicon, async downloads, etc.
See Installation Guide for configuration details.
- Architecture Overview: System architecture and design
- Data Pipeline: Pipeline architecture details
- Changelog: Version history and updates
- Contributing Guide: Development guidelines
- Full documentation: https://quantmini.readthedocs.io/
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/performance/Access monitoring dashboards:
# View health status
python scripts/check_health.py
# View performance metrics
cat logs/performance/performance_metrics.json
# Generate report
python scripts/generate_report.pyThe pipeline processes four types of data from Polygon.io:
- Stock Daily Aggregates: Daily OHLCV for all US stocks
- Stock Minute Aggregates: Minute-level data per symbol
- Options Daily Aggregates: Daily options data per underlying
- Options Minute Aggregates: Minute-level options data (all contracts)
Landing Layer Bronze Layer Silver Layer Gold Layer
(Raw Sources) (Validated) (Enriched) (ML-Ready)
β β β β
S3 CSV.GZ Files β Validated Parquet β Feature-Enriched β Qlib Binary
(Polygon) (Schema Check) (Indicators) (Backtesting)
Adaptive Ingestion: Streaming/Batch/Parallel based on available memory
Feature Engineering: DuckDB/Polars for calculated indicators
Binary Conversion: Optimized for ML training and backtesting
- Landing: Async S3 downloads to
landing/polygon-s3/ - Bronze: Ingest and validate to
bronze/- schema enforcement, type checking - Silver: Enrich with features to
silver/- calculated indicators, returns, alpha - Gold: Convert to ML formats in
gold/qlib/- optimized for backtesting - Query: Fast access via DuckDB/Polars from any layer
Data Quality Progression: Landing (raw) β Bronze (validated) β Silver (enriched) β Gold (ML-ready)
- Never commit
config/credentials.yaml(in .gitignore) - Store credentials in environment variables for production
- Use AWS Secrets Manager or similar for cloud deployments
- Rotate API keys regularly
# Reduce memory usage
export MAX_MEMORY_GB=16
# Force streaming mode
export PIPELINE_MODE=streaming# Reduce concurrent downloads
# Edit config/pipeline_config.yaml:
# optimizations.async_downloads.max_concurrent: 4# Enable profiling
# Edit config/pipeline_config.yaml:
# monitoring.profiling.enabled: true
# Run and check logs/performance/See the full documentation for more troubleshooting tips.
See Contributing Guide for development guidelines.
MIT License - see LICENSE file for details
- Polygon.io: S3 flat files data source
- Qlib: Quantitative investment framework
- Polars: High-performance DataFrame library
- DuckDB: Embedded analytical database
- Documentation: https://quantmini.readthedocs.io/
- Issues: GitHub Issues
- Email: zheyuan28@gmail.com
Built with: Python 3.10+, uv, qlib, polygon, polars, duckdb, pyarrow
Optimized for: macOS (Apple Silicon M1/M2/M3), 24GB+ RAM, SSD storage