Namanmahor/plat 265 support file size info reading#8663
Namanmahor/plat 265 support file size info reading#8663NamanMahor wants to merge 15 commits intomainfrom
Conversation
…upport-file-size-info-reading
…upport-file-size-info-reading
…upport-file-size-info-reading
| entries, err := pagination.CollectAll(ctx, | ||
| func(ctx context.Context, pz uint32, tk string) ([]drivers.ObjectStoreEntry, string, error) { | ||
| return b.ListObjectsForGlob(ctx, opts.Glob, pz, tk) | ||
| }, | ||
| 1000) |
There was a problem hiding this comment.
Does this add new network overhead?
I'm wondering if it would be better to still also support full listings in ListObjectsForGlob (e.g. pageSize == 0 could mean return all).
There was a problem hiding this comment.
No, there is no new network overhead because List also use ListPage internally with pagesize 1000 for all s3,gcs and azure.
Our all other listing apis have default page size if pagesize is zero. changing it for this api feels inconsistent.
There was a problem hiding this comment.
Our all other listing apis have default page size if pagesize is zero. changing it for this api feels inconsistent.
But this isn't an API, this is an internal function, right? Maybe I'm missing something.
Usually we try to have default values only at the "outer" level, i.e. in the API handlers, but require explicit config for inner functions.
There was a problem hiding this comment.
@begelundmuller is it okay to handle it in separately PR? we will need to change to other ListObject and ListBuckets too with this ListBucketsForGlobs
| // Handle GCS | ||
| var q *storage.Query | ||
| if as(&q) { | ||
| // Only fetch the fields we need. | ||
| _ = q.SetAttrSelection([]string{"Name", "Size", "Created", "Updated"}) | ||
| if startAfter != "" { | ||
| q.StartOffset = startAfter | ||
| } | ||
| } | ||
| // Handle S3 | ||
| var s3Input *s3.ListObjectsV2Input | ||
| if as(&s3Input) { | ||
| if startAfter != "" { | ||
| s3Input.StartAfter = aws.String(startAfter) | ||
| } | ||
| } | ||
| return nil |
There was a problem hiding this comment.
azure does not support startAfter or offset. so we have to iterate to skip it.
There was a problem hiding this comment.
- So does that mean it needs to re-scan the full listing on each page fetch? Assume there is 1M objects, the glob matches all of them, and the page size is 1000 – then how many Azure calls would it end up making?
- I'm not sure if this applies here, but this page seems to indicate there's a
startAfteroption: https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id- If that doesn't work, is there some other offset/page token from Azure we can use?
There was a problem hiding this comment.
-
No. we are using page token so we were not rescan all pages only the last page for startAfter(but now added startAfter too for azure). 1M/1000 calls.
-
upgraded azure blog storage to latest. now we can use startAfter.
…upport-file-size-info-reading
| } else if lastProcessedIdx != -1 { | ||
| startAfter = retval[lastProcessedIdx].Key | ||
| } | ||
| break |
There was a problem hiding this comment.
When len(entries) == validPageSize, how can lastProcessedIdx not always be len(retval)-1?
There was a problem hiding this comment.
Yes — suppose the first call to b.bucket.ListPage returns some objects that are filtered out by the glob pattern. We may need additional calls to collect enough matching entries to reach the requested page size, so we don’t necessarily consume the entire retval page.
| driverPageToken = nextDriverPageToken | ||
| startAfter = "" |
There was a problem hiding this comment.
What if all paths on the page were before startAfter? Doesn't it then need to keep startAfter so it can skip the necessary objects on the next page?
There was a problem hiding this comment.
startAfter is used when we return the page token which is half processed. so then the page is fully read we do not need startAfter next page is read from start.
…upport-file-size-info-reading
…upport-file-size-info-reading
PLAT-265: Support file size info reading
Checklist: