Skip to content

Conversation

@tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 9, 2025

Depends on rapidsai/raft#2770

Implementation of binary ivf flat index (bitwise hamming metric for the IVF Flat index)

Key Features

1. Binary Index Structure

  • Added binary_centers_ field to store cluster centers as packed uint8_t arrays for binary data
  • Index automatically detects BitwiseHamming metric and configures itself for binary operation
  • Only support uint8_t inputs with BitwiseHamming and add only single instantiations of newly added kernels

2. K-means Clustering for Binary Data

The clustering approach for binary data required special handling:

  • Expanded Space Clustering: Binary data (uint8_t) is expanded to signed representation (int8_t) where each bit becomes ±1

    • 0 → -1, 1 → +1 transformation enables meaningful centroid computation
    • Clustering performed using L2 distance in the expanded dimensional space
  • Centroid Quantization: After computing float centroids in expanded space, they are converted back to binary format:

    • Centroids are stored as packed uint8_t arrays
    • KMeans (coarse) prediction is done on these quantized centroids with the BitwiseHamming distance.

3. Distance Kernels

Coarse Search (Cluster Selection)

  • Implemented specialized bitwise_hamming_distance_op for query-to-centroid distances in order to compute PairwiseDistances

Fine-Grained Search (Within Clusters)

Extended the interleaved scan kernel (ivf_flat_interleaved_scan.cuh) with specialized templates for BitwiseHamming:

  • Veclen-based optimization: Different code paths based on vectorization width

    • Veclen=16,8,4: Load data as uint32_t, use __popc(x ^ y) for 4-byte Hamming distance
    • Veclen=1,2: Byte-wise XOR and population count
  • Efficient memory access patterns:

    • Maintains interleaved data layout for coalesced memory access
    • Specialized loadAndComputeDist templates for uint8_t that leverage vectorized loads

as of 10/17/2025
Binary size increase:
branch-25.12 (CUDA 12.9 + X86): 1232.414 MB
This PR (CUDA 12.9 + X86): 1251.051 MB

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tarang, thank you for your work on this PR, it looks good to me!

}
};

template <int Veclen, typename T, typename AccT>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have any input or output with type T, therefore I do not see why we need this param. Or is the static assert participating SFINAE logic?

In any case, this is does not need to hold the PR.

Comment on lines +45 to +46
uint32_t masked_val = xor_val & 0xffu;
int popcount = __popc(masked_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect use cases where padding dims to be divisible by 32 would be a problem? But this discussion need not hold this PR, I have created #1613 to continue this discussion.

@tfeher
Copy link
Contributor

tfeher commented Dec 3, 2025

/ok to test 91c6734

@tarang-jain
Copy link
Contributor Author

/ok to test 07354d1

@tarang-jain
Copy link
Contributor Author

/ok to test 07e1837

@tarang-jain tarang-jain changed the base branch from main to release/26.02 January 20, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants