Skip to content

Apply filters to variable-length data#6178

Open
crusaderky wants to merge 6 commits intoHDFGroup:developfrom
crusaderky:ignore_filters
Open

Apply filters to variable-length data#6178
crusaderky wants to merge 6 commits intoHDFGroup:developfrom
crusaderky:ignore_filters

Conversation

@crusaderky
Copy link

@crusaderky crusaderky commented Jan 28, 2026

  • Closes can_apply and set_local filter callbacks skipped for H5T_VARIABLE dtypes #5942. A more detailed explanation of the rationale for this PR is in the linked issue.
  • Mandatory filters can be now applied to variable-length data. This notably includes h5py object strings and NpyStrings.
  • Optional filters applied to variable-length data no longer skip can_apply and set_local, only to jump directly to filter. This fixes hdf5_blosc and HDF5_Blosc2, which rely on set_local. Note that all filters which have neither can_apply nor set_local have always worked fine with vlen data, as long as they were tagged as optional. Notable examples are deflate, lzf, bzip2 (from hdf5plugin), and lz4 (also from hdf5plugin).
  • Lossy filters, such as scale-offset, are and remain untested and most likely broken, as I expect them to treat the vlen metadata as numbers and corrupt it. The right place to ensure proper behaviour is their can_apply function, which however is broken (filters: can_apply does nothing for optional filters #6161).

Important

Enhance filter application to variable-length data, fixing optional filter handling and adding tests.

  • Behavior:
    • Mandatory filters can now be applied to variable-length data, including h5py strings.
    • Optional filters no longer skip can_apply and set_local steps, fixing issues with hdf5_blosc and HDF5_Blosc2.
    • Lossy filters remain untested and potentially broken for vlen data.
  • Code Changes:
    • Modify H5Z_ignore_filters() in H5Z.c to handle optional filters correctly.
    • Add tests in dsets.c for scalar and null datasets with optional filters, and for set_local updates with vlen data.
  • Tests:
    • Add test_optional_filters_scalar, test_optional_filters_null, test_set_local_updates_cd, test_set_local_updates_cd_vlen, and test_deflate_vlen in dsets.c.

This description was created by Ellipsis for 70ab78a. You can customize this summary. It will automatically update as commits are pushed.

} /* end test_optional_filters_scalar() */

/*-------------------------------------------------------------------------
* Function: test_optional_filters_null
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was previously untested.

} /* end test_set_local_updates_cd() */

/*-------------------------------------------------------------------------
* Function: test_set_local_updates_cd_vlen
Copy link
Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails in the main branch, complaining that filters can't be applied to vlen data.
If I replace H5Z_FLAG_MANDATORY with H5Z_FLAG_OPTIONAL below, it crashes on assert(0) inside the filter callback, highlighting that set_local has not been called.

} /* end test_set_local_updates_cd_vlen() */

/*-------------------------------------------------------------------------
* Function: test_deflate_vlen
Copy link
Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests what happens when filter actually processes the contiguous raw chunk data.

This test fails in the main branch, complaining that filters can't be applied to vlen data.
However, if I replace H5Z_FLAG_MANDATORY with H5Z_FLAG_OPTIONAL below, it passes in the main branch, thanks to the fact that the deflate filter has no set_local callback.

Running deflate on more substantial data shows that the output file size is much smaller than the uncompressed version.

crusaderky added a commit to crusaderky/hdf5-pixi that referenced this pull request Jan 28, 2026
crusaderky added a commit to crusaderky/hdf5-blosc that referenced this pull request Jan 28, 2026
crusaderky added a commit to crusaderky/hdf5-blosc that referenced this pull request Jan 28, 2026
@crusaderky
Copy link
Author

Downstream tests in hdf5-blosc: Blosc/hdf5-blosc#38


/* Create dataset with optional filter */
if ((dsid = H5Dcreate2(file, DSET_OPTIONAL_SCALAR, strtid, sid, H5P_DEFAULT, dcplid, H5P_DEFAULT)) < 0)
/* Write data to the dataset */
Copy link
Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks that the filter callback is never invoked, which the previous version of this test left untested.

@hyoklee hyoklee added the Component - C Library Core C library issues (usually in the src directory) label Jan 28, 2026
@crusaderky
Copy link
Author

crusaderky commented Jan 29, 2026

@hyoklee is it possible to avoid having CI re-authorised by a maintainer every time I need to fix a trivial error? This is hurting velocity quite badly.

crusaderky added a commit to crusaderky/hdf5-pixi that referenced this pull request Jan 29, 2026
crusaderky added a commit to crusaderky/hdf5-blosc that referenced this pull request Jan 29, 2026
@jhendersonHDF
Copy link
Collaborator

@crusaderky I believe this is simply a requirement due to our current GitHub configuration. I've approved our workflows to run again, but I believe we'll need to continue re-approving them as necessary. I'll keep an eye on this PR, but feel free to reach out again as needed.

@crusaderky
Copy link
Author

@jhendersonHDF ready for another go

@crusaderky
Copy link
Author

CI is green; this is ready for review and potentially for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - C Library Core C library issues (usually in the src directory)

Projects

Status: To be triaged

Development

Successfully merging this pull request may close these issues.

can_apply and set_local filter callbacks skipped for H5T_VARIABLE dtypes

4 participants