Feat/arweave id offset indexing by JamesPiechota · Pull Request #616 · permaweb/HyperBEAM

JamesPiechota · 2026-01-19T18:25:29Z

Add support for indexing all transactions and bundled ans104 data items in a block. Index maps the tx or item ID to an offset in the weave. When loading the tx or item, hb_store_arweave will query the range of weave data from the configured chunk node and deserialize it.

New options:

arweave_index_ids: when true dev_copycat_arweave will index the transactions and ans104 items in a block
arweave_index_store: configure the store to use for maintaining the index
routes => #{ <<"template">> => <<"/chunk">> }: configure the the gateway to use for GET /chunk requests.

Index format:

<<"ID">> -> <<"IsTX:Offset:Length">>
The boolean "IsTX" is neededto indicate whether the indexed item is an L1 TX or an L2 DataItem. Reason for the distinction is we need to query the TX header to get the tags for an L1 TX, but that's not needed for an L2 DataItem.

Questions/Notes

~copycat@1.0: I updated how it iterates through the range of blocks to be indexed. Let me know I should revert.
- Old behavior: Count was exclusive and would keep going if from was less than to. e.g. from=1000001&to=1000000 will index only block 1000001, from=999999&to=1000000 will index all blocks 999999 and lower.
- New behavior: Count is inclusive and stops when from is less than to. e.g. from=1000001&to=1000000will index blocks1000001and1000000, from=999999&to=1000000` will index no blocks.
I'm still not sure on when to use hb_ao:resolve vs. hb_ao:get. This PR primarily uses hb_ao:resolve and only uses hb_ao:get when querying a key from a map.
When should opt keys be atoms (e.g. arweave_index_store) vs. binaries (e.g. <<"arweave-index-store">>)? I tried to mimic the conventions already in use.
I added a <<"exclude-data">> arg to dev_arweave to allow it to query only the TX header without also downloading the data. I had initially omitted the flag and just forced the data download to be a separate operation, but this created some complexity around the overlap between L2 and L1 IDs. An L2 ID always maps to the full data item, but an L2 ID would only map to the TX header and then the client would have to do a second resolve to get the data payload. Current approach keeps legacy behavior the same (both L2 and L1 IDs map to the full payload), with the option of only querying the TX header where needed.
In order to validate an L1 TX we need to recompute the data_root. This computation depends on how the serialized data was "chunked". Unfortunately this information is not currently preserved in HB messages. The majority of transactions likely follow the arweave-js chunking scheme. This PR implements that chunking scheme as the default. In the future we may need to either track chunk boundaries (e.g. as commitment fields), or support multiple chunking schemes (and track those as commitment fields)
When dev_arweave queries the gateway's /chunk endpoint it assumes the gateway is running a recent commit from the arweave repo (4de096e20028df01f61002620bd7d39297064a5b). This commit has not yet (as of Jan 25, 2026) been included in any formal arweave releases.
There are still some types of data items that are not supported in HB (e.g. any dataitem that is not signed with RSA). Those items will be indexed, but HB will fail when it tries to read and deserialize them. This is an existing limitation not addressed by this PR, but just calling it out.
The block indexing logic will currently not recurse into nested bundles. It will index the top-level L1 bundle, and then all data items within that bundle, but it won't recurse further.

Incorporated official announcement from Jan 22, 2026: Completed Milestones: - ✅ M1: AO Core - ✅ M2: Native Execution & TEE Support - ✅ M3: LegacyNet Migration (100x performance gains) M4 Official Features: - Decentralized Schedulers - LiveNet Staking Marketplace - Streaming Token Distributions Added comprehensive branch-to-PR mapping: - 57 open PRs with owners and status - 70+ merged PRs since release - Branch ownership for all active development Key contributors working on M4: - samcamwilliams: Core protocol, native tokens (expr/1.5, feat/native-tokens) - speeddragon: Cryptography, fixes (feat/ecdsa_support, PR permaweb#574) - JamesPiechota: Indexing (feat/arweave-id-offset-indexing, PR permaweb#616) - noahlevenson: Security testing (impr/secure-actions) - PeterFarber: TEE attestation (feat/c_snp)

… (i.e. true TX headers) Specify exclude-data=1 to exclude the data

…ests neo-arweave has a roundrobin scheme where it will try several nodes looking for a chunk. arweave.net delegates to a single node regardless of whether or not it has the chunk - this can yield unreliable results (same query sometimes returns data sometimes 404s)

…point

JamesPiechota · 2026-01-26T21:04:04Z

src/dev_copycat_arweave.erl

+    %% it).
+    TestStore = hb_test_utils:test_store(),
+    StoreOpts = #{ <<"index-store">> => [TestStore] },
+    Store = [


Is there a better way to have a test use a test store for all stores? If I don't do this, the test will use the default (mainnet) store some of the time and the test store other times which breaks the test.

JamesPiechota · 2026-01-26T21:04:47Z

src/hb_opts.erl

+                <<"node">> =>
+                    #{
+                        <<"match">> => <<"^/arweave">>,
+                        <<"with">> => <<"https://neo-arweave.zephyrdev.xyz">>,


Route GET /chunk to neo-arweave for now as it is more reliable for this specific endpoint.

JamesPiechota · 2026-01-26T21:05:45Z

src/hb_store_arweave.erl

+            % TODO:
+            % - should this return composite for any index L1 bundles?
+            % - if so, I guess we need to implement list/2?
+            % - for now we don't index nested bundle children, but once we
+            %   do we may nalso need to return composite for them.


Calling this TODO out. Not sure if some of this must be addressed before we merge or whether it can all wait for a future PR?

Sorry, I was focused on my work, and I didn't look into this until now.

composite is used as a definition for a folder. The information (content type, data, etc) is under a signature, which is a folder (composite). I think read provides everything we need to read, so we don't need to define composite.

Main change was implementing hb_store_arweave:type/2

…indexed - Old behavior: Count was exclusive and would keep going if `from` was less than `to`. e.g. `from=1000001&to=1000000` will index only block `1000001`, `from=999999&to=1000000` will index all blocks 999999 and lower. - New behavior: Count is inclusive and stops when `from` is less than `to`. e.g. from=1000001&to=1000000` will index blocks `1000001` and `1000000`, `from=999999&to=1000000` will index no blocks.

JamesPiechota · 2026-01-26T21:43:26Z

src/dev_copycat_arweave.erl

+fetch_blocks(Req, Current, To, _Opts) when Current < To ->
    ?event(copycat_arweave,
        {arweave_block_indexing_completed,
-            {reached_target, Current},
+            {reached_target, To},
            {initial_request, Req}
        }
    ),
-    {ok, Current};
+    {ok, To};
 fetch_blocks(Req, Current, To, Opts) ->
    BlockRes =
        hb_ao:resolve(
            <<
                ?ARWEAVE_DEVICE/binary,
                "/block=",
                (hb_util:bin(Current))/binary
            >>,
            Opts
        ),
    process_block(BlockRes, Req, Current, To, Opts),
    fetch_blocks(Req, Current - 1, To, Opts).



Old behavior: Count was exclusive and would keep going if from was less than to. e.g. from=1000001&to=1000000 will index only block 1000001, from=999999&to=1000000 will index all blocks 999999 and lower.

New behavior: Count is inclusive and stops when from is less than to. e.g. from=1000001&to=1000000 will index blocks 1000001, and 1000000, from=999999&to=1000000 will index no blocks.

Rarely we find non-4096 bit RSA-signed transactions in the blockchain

An example config.json: ``` { "arweave_index_ids": true, "arweave_index_store": { "index-store": [ { "store-module": "hb_store_lmdb", "name": "cache-mainnet/lmdb", "ao-types": "store-module=\"atom\"" } ] } } ```

enable `copycat_perf` to see metrics logged also written to `copycat_perf.csv`

configure with arweave_index_workers

speeddragon · 2026-02-02T14:55:18Z

When should opt keys be atoms (e.g. arweave_index_store) vs. binaries (e.g. <<"arweave-index-store">>)? I tried to mimic the conventions already in use.

There is a function to convert from string to atom (hb_opts:mimic_default_types), but you need to include in the code, because it isn't always called. My PR https://github.com/permaweb/HyperBEAM/pull/563/changes fixed in some cases, but we should change this to make it less error prone.

we need all transactions in order to ensure we can build the correct offsts for any of them

If an L1 or ans104 transaction includes any tags which have a name clash with built-in fields (e.g. anchor, target, data), the tags will be preserved as original-tags. For fields other than `data` the tag values may be promoted to top-level message keys if the built-in field has a default value (i.e. only promoted if there's no value clash) A `data` tag will never be promoted to a top-level message key, but it will be preserved via original-tags. Note: in some situations this can create some redundant data. E.g. a tag may be preseved in original-tags *and* as a top-level message key. This redundancy already occurs in other situations, though, so is assumed to be acceptable.

… tag

samcamwilliams and others added 11 commits January 24, 2026 21:54

wip: Arweave TXID->offset indexing in copycat@1.0

0968914

fix: only treat path segments as ids if they can't further segmented

69fe4cf

wip: implement ~arweave@2.9-pre/chunk

f6532ed

test: make dev_arweave tests more reliable

d767535

wip: support L1 TX messages that have data_size/data_root but no data…

bb45022

… (i.e. true TX headers) Specify exclude-data=1 to exclude the data

wip: write and read TX-bundle to hb_store_arweave

7d6137b

wip: working on getting a single data item from a bundle to load

3438a06

fix: chunk up L1 tx data according to the arwave-js logic

512fd16

fix: use legacy chunking mode for format=1 transactions

c629125

fix: add full block index test to dev_copycat_arweave

bfce13b

JamesPiechota force-pushed the feat/arweave-id-offset-indexing branch from c1a32eb to bfce13b Compare January 25, 2026 02:55

test: update dev_copycat_arweave test to use the regular device entry…

6fb2ce0

…point

JamesPiechota marked this pull request as ready for review January 26, 2026 02:21

JamesPiechota commented Jan 26, 2026

View reviewed changes

fix: enable hb_ao:resolve to work for indexed arweave items

73b4fad

Main change was implementing hb_store_arweave:type/2

JamesPiechota force-pushed the feat/arweave-id-offset-indexing branch from 353a1eb to 73b4fad Compare January 26, 2026 21:08

JamesPiechota added 2 commits January 26, 2026 16:21

fix: hb_store_arweave now uses start offset instead of end offset

f153d81

JamesPiechota commented Jan 26, 2026

View reviewed changes

JamesPiechota added 5 commits January 26, 2026 17:50

fix: support bundles with large headers

a2ed35d

fix: skip L1 TXs that are signed with ECDSA

13a2eb3

impr: log count of items indexed in each block

0c71ec6

chore: cleanup

93ccc5b

fix: allow RSA signatures of varying bitsize

7ba7275

Rarely we find non-4096 bit RSA-signed transactions in the blockchain

JamesPiechota force-pushed the feat/arweave-id-offset-indexing branch from f8617de to 7ba7275 Compare January 27, 2026 18:52

fix: skip transactions with non-string tags

cb101c8

JamesPiechota added 8 commits January 28, 2026 19:23

impr: allow hb_http_client to retry on HTTP response

59d9906

fix: track and log skipped transactions while indexing

dbb7460

fix: don't set the arweave index options in the default hb_opts

1742301

An example config.json: ``` { "arweave_index_ids": true, "arweave_index_store": { "index-store": [ { "store-module": "hb_store_lmdb", "name": "cache-mainnet/lmdb", "ao-types": "store-module=\"atom\"" } ] } } ```

fix: allow hb_http:request to handle an {error, _} response

62f1071

wip: add some performance metrics (may remove later)

e94951b

enable `copycat_perf` to see metrics logged also written to `copycat_perf.csv`

wip: move dev_copycat_arweave performance tracking over to hb_event

28e4f24

wip: add metric for full block processing

1caeaeb

impr: parallelize arweave indexing

a14c38d

configure with arweave_index_workers

JamesPiechota added 3 commits February 2, 2026 21:18

fix: skip indexing a block if any transactions are missing

cf1a1f5

we need all transactions in order to ensure we can build the correct offsts for any of them

test: add test for indexing a block which contains a tx with a data…

cdc932a

… tag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/arweave id offset indexing#616

Feat/arweave id offset indexing#616
JamesPiechota wants to merge 32 commits intoedgefrom
feat/arweave-id-offset-indexing

JamesPiechota commented Jan 19, 2026 •

edited

Loading

Uh oh!

JamesPiechota Jan 26, 2026

Uh oh!

JamesPiechota Jan 26, 2026

Uh oh!

JamesPiechota Jan 26, 2026

Uh oh!

speeddragon Feb 2, 2026

Uh oh!

JamesPiechota Jan 26, 2026 •

edited

Loading

Uh oh!

speeddragon commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JamesPiechota commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesPiechota Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

JamesPiechota Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

JamesPiechota Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

speeddragon Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

JamesPiechota Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

speeddragon commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JamesPiechota commented Jan 19, 2026 •

edited

Loading

JamesPiechota Jan 26, 2026 •

edited

Loading