fix(ccusage): Fix underreporting output tokens#835
fix(ccusage): Fix underreporting output tokens#835jstasiak wants to merge 1 commit intoryoppippi:mainfrom
Conversation
Claude Code writes multiple JSONL entries per API response during streaming. Each entry shares the same messageId:requestId hash, but output_tokens accumulates incrementally, starting near 0 and reaching the final count when the response completes. The old dedup logic kept the first entry per hash, resulting in using low output_tokens values for some responses. The impact of that varied. In my Claude Code sessions sometimes there was no difference at all (presumably because of no incremental response streaming), sometimes it made ~15% difference (in cost terms) in a given 5-hour block.
📝 WalkthroughWalkthroughThis PR refactors the deduplication logic in the data loader from per-entry helper functions to a hash-based Map approach. It replaces Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
It's more than likely that this fix, while being an improvement, is incomplete: #705 (comment) |
Claude Code writes multiple JSONL entries per API response during streaming. Each entry shares the same messageId:requestId hash, but output_tokens accumulates incrementally, starting near 0 and reaching the final count when the response completes.
The old dedup logic kept the first entry per hash, resulting in using low output_tokens values for some responses. The impact of that varied.
In my Claude Code sessions sometimes there was no difference at all (presumably because of no incremental response streaming), sometimes it made ~15% difference (in cost terms) in a given 5-hour block.
Review note: I'm not sure the change to the
should deduplicate entries across sessionstest makes sense and matches the goals of the project. I figure we'll have a problem regardless of the way we assign tokens (session 1 vs session 2)? Is there something more sophisticated we can do here?Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.