Skip to content

Move news data to bronze layer and add Phase 3 download script#4

Merged
nittygritty-zzy merged 2 commits intomainfrom
feature/layer-based-metadata-and-docs-consolidation
Oct 23, 2025
Merged

Move news data to bronze layer and add Phase 3 download script#4
nittygritty-zzy merged 2 commits intomainfrom
feature/layer-based-metadata-and-docs-consolidation

Conversation

@nittygritty-zzy
Copy link
Owner

Summary

Consolidates news data into the bronze layer following Medallion Architecture principles and adds a new Phase 3 download script for historical news backfill.

Changes

Path Updates

  • CLI Command (src/cli/commands/polygon.py): Changed default news output from news/ to bronze/news/
  • Ingestion Script (scripts/ingestion/ingest_news.py): Updated to use centralized get_quantlake_root() and bronze layer path

New Feature

  • Phase 3 Download Script (scripts/download/phase3_news_download.py):
    • Downloads 10 years of historical news for all active tickers
    • 8 parallel workers with rate limiting
    • Progress tracking and comprehensive logging
    • Designed for bulk historical backfill

Documentation

  • PROJECT_MEMORY.md:
    • Updated directory structure to show all bronze layer subdirectories
    • Added news data details: 12GB, 739,424 files, 10 years (2015-10-25 to 2025-10-22)
    • Updated size estimates for minute data (stocks: 34GB, options: 17GB)
    • Added comprehensive Polygon API endpoint coverage

Additional Feature

  • Financial Ratios Command (src/cli/commands/polygon.py):
    • New ratios command for downloading pre-calculated financial ratios from Polygon API
    • Supports date filtering for efficient downloads

Data Verification

News data verified in correct location:

  • Path: /Volumes/990EVOPLUS/quantlake/bronze/news/
  • Files: 739,424 parquet files
  • Size: 12GB
  • Tickers: 9,900 active stocks
  • Date Range: 2015-10-25 to 2025-10-22 (10 years)
  • Success Rate: 100%

Testing

  • Verified news data already in correct bronze layer location
  • Confirmed all path references updated to use bronze/news/
  • Phase 3 download script executed successfully (100% success rate)
  • Documentation updated to reflect current data organization

Impact

  • Breaking Changes: None (data already in correct location)
  • New Dependencies: None
  • Configuration Changes: None required (uses existing path utilities)

zheyuan zhao and others added 2 commits October 21, 2025 18:04
Implemented comprehensive historical data loader script with 4 phases:

Phase 1: Corporate Actions + Fundamentals + Short Data (Parallel)
- Corporate actions: Dividends, splits, IPOs (2005-2025)
- Ticker events: Symbol changes/rebranding (all history)
- Fundamentals: Balance sheets, income, cash flow (2010-2025)
- Short data: Short interest & volume (2 years)

Phase 2: Daily Price Data (S3, 10 years, parallel)

Phase 3: News Data (3 years, parallel)

Phase 4: Minute Data (Stocks + Options, sequential, 5 years)

Features:
- Aggressive parallelization (8-10 concurrent jobs for API/S3)
- Sequential processing for Phase 4 (memory-safe for 500GB datasets)
- Skip flags: --skip-confirmation, --skip-minute
- Dry-run mode for execution preview
- Phase-specific execution: --phase N
- Progress tracking and monitoring

Removed: scripts/bulk_download_all_data.sh (consolidated into this script)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update news output path from news/ to bronze/news/ in CLI and scripts
- Add phase3_news_download.py for historical news backfill (10 years)
- Update PROJECT_MEMORY.md with comprehensive bronze layer structure
- Add financial ratios download command to Polygon CLI
- Verify 739K news files (12GB) already in correct bronze location

Data verified:
- 739,424 parquet files (12GB)
- 9,900 active tickers
- 10 years of history (2015-10-25 to 2025-10-22)
- 100% success rate
@nittygritty-zzy nittygritty-zzy merged commit 883bdf0 into main Oct 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments