Skip to content

Namanmahor/plat 265 support file size info reading#8663

Open
NamanMahor wants to merge 15 commits intomainfrom
namanmahor/plat-265-support-file-size-info-reading
Open

Namanmahor/plat 265 support file size info reading#8663
NamanMahor wants to merge 15 commits intomainfrom
namanmahor/plat-265-support-file-size-info-reading

Conversation

@NamanMahor
Copy link
Contributor

@NamanMahor NamanMahor commented Jan 16, 2026

PLAT-265: Support file size info reading

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@NamanMahor NamanMahor marked this pull request as draft January 16, 2026 18:29
@NamanMahor NamanMahor marked this pull request as ready for review January 23, 2026 08:57
Comment on lines +76 to +80
entries, err := pagination.CollectAll(ctx,
func(ctx context.Context, pz uint32, tk string) ([]drivers.ObjectStoreEntry, string, error) {
return b.ListObjectsForGlob(ctx, opts.Glob, pz, tk)
},
1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this add new network overhead?

I'm wondering if it would be better to still also support full listings in ListObjectsForGlob (e.g. pageSize == 0 could mean return all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no new network overhead because List also use ListPage internally with pagesize 1000 for all s3,gcs and azure.

Our all other listing apis have default page size if pagesize is zero. changing it for this api feels inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our all other listing apis have default page size if pagesize is zero. changing it for this api feels inconsistent.

But this isn't an API, this is an internal function, right? Maybe I'm missing something.

Usually we try to have default values only at the "outer" level, i.e. in the API handlers, but require explicit config for inner functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@begelundmuller is it okay to handle it in separately PR? we will need to change to other ListObject and ListBuckets too with this ListBucketsForGlobs

Comment on lines 86 to 102
// Handle GCS
var q *storage.Query
if as(&q) {
// Only fetch the fields we need.
_ = q.SetAttrSelection([]string{"Name", "Size", "Created", "Updated"})
if startAfter != "" {
q.StartOffset = startAfter
}
}
// Handle S3
var s3Input *s3.ListObjectsV2Input
if as(&s3Input) {
if startAfter != "" {
s3Input.StartAfter = aws.String(startAfter)
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Azure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

azure does not support startAfter or offset. so we have to iterate to skip it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. So does that mean it needs to re-scan the full listing on each page fetch? Assume there is 1M objects, the glob matches all of them, and the page size is 1000 – then how many Azure calls would it end up making?
  2. I'm not sure if this applies here, but this page seems to indicate there's a startAfter option: https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id
    • If that doesn't work, is there some other offset/page token from Azure we can use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No. we are using page token so we were not rescan all pages only the last page for startAfter(but now added startAfter too for azure). 1M/1000 calls.

  2. upgraded azure blog storage to latest. now we can use startAfter.

} else if lastProcessedIdx != -1 {
startAfter = retval[lastProcessedIdx].Key
}
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When len(entries) == validPageSize, how can lastProcessedIdx not always be len(retval)-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — suppose the first call to b.bucket.ListPage returns some objects that are filtered out by the glob pattern. We may need additional calls to collect enough matching entries to reach the requested page size, so we don’t necessarily consume the entire retval page.

Comment on lines +156 to +157
driverPageToken = nextDriverPageToken
startAfter = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if all paths on the page were before startAfter? Doesn't it then need to keep startAfter so it can skip the necessary objects on the next page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startAfter is used when we return the page token which is half processed. so then the page is fully read we do not need startAfter next page is read from start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants