Skip to content

Conversation

@Priyamanjare54
Copy link
Contributor

@Priyamanjare54 Priyamanjare54 commented Dec 27, 2025

Purpose

Linked issue: closes #2256

This pull request adds documentation for the COMPACTED table format on the Fluss
website to help users understand what it is, how to configure it, and when it
should be used.

Brief change log

  • Added documentation for the COMPACTED table format under table design
  • Explained supported table types (Log and KV tables)
  • Documented usage with table.changelog.image=WAL and its performance benefits
  • Added guidance on recommended use cases and limitations

Tests

Not applicable. This change only affects documentation.

API and Format

No. This change does not affect any public API or storage format.

Documentation

Yes. This PR introduces new user-facing documentation for the COMPACTED table
format on the Fluss website.

@wuchong
Copy link
Member

wuchong commented Dec 30, 2025

Hi @Priyamanjare54 , it seems this change is empty.

@wuchong
Copy link
Member

wuchong commented Dec 31, 2025

Thanks @Priyamanjare54 , could you update the pull request according to the discussion and proposed structure in #2256 (comment)?

@Priyamanjare54
Copy link
Contributor Author

Thanks for the feedback!
I’ve updated the PR to introduce a dedicated Data Encodings page and moved the COMPACTED documentation under it as discussed in #2256.
The earlier standalone COMPACTED page has been removed accordingly.

Please let me know if you’d like any further adjustments.

@wuchong
Copy link
Member

wuchong commented Dec 31, 2025

@polyzos could you help to review this doc?

@polyzos
Copy link
Contributor

polyzos commented Dec 31, 2025

@wuchong Regarding the Indexed format, is it going to be deprecated, or should we document it as well?
@Priyamanjare54 Thank you for your contribution.
It’s great documenting, however we want to make it as simple as possible, so every user can easily understand the formats and when they should use each and with a current approach, I’m cautious that less technical users might find it a bit harder to understand.

For example we can say that the arrow format is the default one, its benefits and that it allows operations such as column pruning and predicate push down.

However for tables that don’t have such requirements, such as large vector tables maybe, aggregates and joined tables that we select all columns a compacted format might be a better fit for disk and CPU efficiency.

WDYT? If you need more context I can help craft this

@Priyamanjare54
Copy link
Contributor Author

Thanks for the feedback @polyzos! I appreciate the guidance on making this more accessible. I agree that simplifying the explanation will help users better understand when to use each format.

Regarding the Indexed format - could you clarify if this should be included or if it's being deprecated? I want to make sure I'm documenting the right formats @wuchong .
I'll work on a revision based on your suggestions.

@Priyamanjare54
Copy link
Contributor Author

Priyamanjare54 commented Jan 3, 2026

Hi @polyzos @wuchong, just following up on my question from 2 days ago could you please confirm the status of the Indexed format so I can proceed with the revision? Thanks!

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

+1 to remove the indexed format from the doc.

@Priyamanjare54
Copy link
Contributor Author

Thanks for confirming! I’ll proceed with removing the Indexed format from the documentation and update the PR accordingly.

@wuchong wuchong requested a review from polyzos January 4, 2026 13:32
@polyzos
Copy link
Contributor

polyzos commented Jan 4, 2026

@Priyamanjare54 this is great work 👌 I think before merging we can just a few things as “summaries”, like in the beginning add a quick section in terms of “of how to think about encodings”:

How to Think About Encodings in Fluss

In Fluss, a data encoding primarily determines:

  • How data is laid out on disk (columnar vs row-oriented)
  • How efficiently data can be filtered, projected, and scanned
  • Whether the encoding is optimized for streaming scans or key-based access

Encodings in Fluss determine:

  • CPU vs IO tradeoffs
  • Scan-heavy vs lookup-heavy workloads
  • Analytical vs operational access patterns

And then we can add a table with the exact tradeoffs maybe in the bottom of the page.

ARROW vs COMPACTED

Encoding ARROW COMPACTED
Physical layout Columnar Row-oriented
Typical access pattern Scans with projection & filters Full-row reads or key lookups
Column pruning ✅ Yes ❌ No
Predicate pushdown ✅ Yes ❌ No
Storage efficiency Good Excellent
CPU efficiency Better for selective reads Better for full-row reads
Log encoding ✅ Yes ✅ Yes
KV encoding ❌ No ✅ Yes
Best suited for Analytics, streaming analytics State tables, materialized views

WDYT?

@Priyamanjare54
Copy link
Contributor Author

Thanks for the suggestions! I’ve added a short “How to Think About Encodings in Fluss” section near the top and included an ARROW vs COMPACTED comparison table summarizing the trade-offs.

@polyzos polyzos merged commit eb75d55 into apache:main Jan 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Document the Compacted format support for Log and PK Tables

3 participants