-
-
Notifications
You must be signed in to change notification settings - Fork 371
Cache the result of the zero-check #3628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ccf00ea to
a35f878
Compare
a35f878 to
26e38d4
Compare
|
Ok you are all sophisticated enough to catch my hack solution ^_^: |
|
not sure i hvae time to troubleshoot this or to think of a worthy high performance solution. |
|
there's no way |
|
if there's some operation |
That is my understanding, you have to start to call I don't see a way through this without help in defining "likely_not_equal" thoughts there appreciated^^^^ If i have an MRI image, do I really need to tell you that it is exactly zero? you probably just care that it is "very likely non-zero" in this usecase, but the strict equality checks makes it hard to take any shortcuts. |
|
for zarr, we really do need to check every single element in a chunk before concluding that all elements in the chunk are equal to the fill value. but if the performance of this check is a serious concern, then at least when reading data we could check if the compressed chunk is byte-wise identical to a compress fill_value chunk |
For the codepath that I describe in #3627 really???? zarr will crash if a chunk is identically 0 and saved? |
|
I haven't really used zarr in depth before (my fork was like 6 years out of date), but doing this change, and running the tests it seems that only a few fail, and they seem to be checking that you discard zero checks. I don't disagree that it is "nice" to ensure that things are zero, but not strictly required. In many cases, i would be ok with compressing some extra zeros. Your idea of "checking that the compressed data" matches that of "compressed zeros" is fine, but again, not "strictly necessary" and adds a level of complexity that might not be required in all cases. diff --git a/src/zarr/core/codec_pipeline.py b/src/zarr/core/codec_pipeline.py
index fd557ac4..18b96911 100644
--- a/src/zarr/core/codec_pipeline.py
+++ b/src/zarr/core/codec_pipeline.py
@@ -413,12 +413,12 @@ class BatchedCodecPipeline(CodecPipeline):
if chunk_array is None:
chunk_array_batch.append(None) # type: ignore[unreachable]
else:
- if not chunk_spec.config.write_empty_chunks and chunk_array.all_equal(
- fill_value_or_default(chunk_spec)
- ):
- chunk_array_batch.append(None)
- else:
- chunk_array_batch.append(chunk_array)
+ # if not chunk_spec.config.write_empty_chunks and chunk_array.all_equal(
+ # fill_value_or_default(chunk_spec)
+ # ):
+ # chunk_array_batch.append(None)
+ # else:
+ chunk_array_batch.append(chunk_array)
chunk_bytes_batch = await self.encode_batch(
[ |
|
the reason why we avoid saving chunks that are identical to the fill value is to avoid unnecessary IO operations and reduce the number of stored objects. this only works for chunks that are identical to the fill value because the zarr spec instructs implementations to treat missing chunks as equal to chunks filled with |
|
aha didn't see that this was closed, I accidentally re-opened it. feel free to re-close it! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3628 +/- ##
==========================================
- Coverage 60.90% 60.44% -0.46%
==========================================
Files 86 86
Lines 10174 10175 +1
==========================================
- Hits 6196 6150 -46
- Misses 3978 4025 +47
🚀 New features to boost your workflow:
|
Exactly, and I don't know how to resolve the GPU usecase. Today, i'm purely exploring what it would mean to convert some of my datasets from HDF5 to zarr. Even if it is slow.... the conversion will complete.... |
I understand and believe that in practice. But do consider the flip side:
I think the semantics you are looking for just aren't there for the "numpy" API. What you are trying to do is to "optimize away one chunk of data on disk", but perhaps, I want to save 100ms. Let me skip the 0 check? or give you a quick answer? |
you can skip this check by changing the zarr-python/docs/user-guide/config.md Line 32 in b873691
|
|
historically checking if a chunk is empty or not has not been a serious performance bottleneck compared to reading, decompression, recompression, and writing. do you have some performance metrics from your use case that suggest otherwise? |
|
As i described in #3627 This is a problem if:
each might trigger a terribly slow read, on top of the read that is triggered when reading the data the final time to compress it. |
|
Leaving this example here since it helped me find https://mail.python.org/archives/list/numpy-discussion@python.org/thread/YY2OXO4SGCSXYSAYVOVWTLU25PCUQJT7/ Ultimately, this "zero" check just create memory that is on the order of the entire zarr store. It may be "creating it and deleting it very fast", but its still pretty "underwhelming". When going from "store -> store", there needs to be a faster mechanism to shortcut this memory benchmark scriptimport numpy as np
import psutil
import time
import multiprocessing
from multiprocessing import Queue
def monitor_memory(pid, queue, interval=0.01):
proc = psutil.Process(pid)
peak_rss = 0
samples = []
while True:
try:
rss = proc.memory_info().rss / 1024**2
samples.append((time.time(), rss))
if rss > peak_rss:
peak_rss = rss
time.sleep(interval)
except psutil.NoSuchProcess:
break
queue.put(('done', peak_rss, samples))
def run_benchmark():
a = np.full((2, 1024, 1024, 1024), fill_value=0, dtype='uint8')
b = np.broadcast_to(
np.full(1, fill_value=1, dtype='uint8'),
a.shape
)
assert all(s == 0 for s in b.strides)
np.array_equal(a, b)
if __name__ == '__main__':
queue = Queue()
child_process = multiprocessing.Process(target=run_benchmark)
child_process.start()
monitor_process = multiprocessing.Process(
target=monitor_memory,
args=(child_process.pid, queue)
)
monitor_process.start()
child_process.join()
time.sleep(0.2)
monitor_process.terminate()
monitor_process.join()
status, peak_rss, samples = queue.get()
print("\n=== Memory Monitoring Results ===")
print(f"Peak RSS: {peak_rss:.1f} MB")
print(f"Total samples: {len(samples)}")
N_samples = 30
if samples:
print(f"\nMemory timeline (first {N_samples} and last {N_samples} samples):")
for i, (timestamp, rss) in enumerate(samples[:N_samples]):
print(f" {timestamp:.2f}s: {rss:.1f} MB")
if len(samples) > N_samples * 2:
print(" ...")
for i, (timestamp, rss) in enumerate(samples[-N_samples:], start=len(samples)-N_samples):
print(f" {timestamp:.2f}s: {rss:.1f} MB")I have a few next steps:
I think that an optimization for array_equal will go a long way..... and hopefully it has positive reception at numpy upstream |
Closes #3627
I ended up moving around my suggestion so that it could be more targetted to the one usecase I care about.
LMK what you think.
[Description of PR]
TODO:
docs/user-guide/*.mdchanges/