Skip to content

Conversation

@davidberenstein1957
Copy link
Contributor

@davidberenstein1957 davidberenstein1957 commented Jan 28, 2025

I added a basic integration for the Hub.

import numpy as np

from vicinity import Backend, Metric, Vicinity

# Create some dummy data
items = [
    {"name": "triforce", "id": 0},
    {"name": "master sword", "id": 1},
    {"name": "hylian shield", "id": 2},
    {"name": "boomerang", "id": 3},
    {"name": "hookshot", "id": 4},
]
vectors = np.random.rand(len(items), 128)

# Initialize the Vicinity instance (using the basic backend and cosine metric)
vicinity = Vicinity.from_vectors_and_items(vectors=vectors, items=items)
vicinity.push_to_hub("davidberenstein1957/my-vicinity-repo")
vicinity = Vicinity.load_from_hub("davidberenstein1957/my-vicinity-repo")

Resulting in https://huggingface.co/datasets/davidberenstein1957/my-vicinity-repo
You can also find datasets here: https://huggingface.co/datasets?other=vicinity

davidberenstein1957 and others added 11 commits January 20, 2025 17:22
- Updated README.md to clarify that items can be strings or other serializable objects.
- Modified the Vicinity class to accept a broader range of item types by changing type hints from `str` to `Any` in several methods.
- Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type.
- Simplified the logic for checking and appending tokens in the insert method, ensuring that duplicate tokens are properly managed.
- Updated the `items` fixture to return a mix of dictionaries and strings based on index parity.
- Modified `test_vicinity_insert_duplicate` to use the updated `items` fixture for inserting items.
- Adjusted `test_vicinity_delete_and_query` to reference items by their indices instead of hardcoded values.
- Enhanced the Vicinity class to streamline token management, ensuring proper handling of duplicates and improving error messaging for token deletions.
Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
…ling

- Replaced the nested loop for checking duplicates with a single extend operation for tokens.
- Improved efficiency by directly appending tokens to the items list, ensuring proper management of duplicates.
…ling

- Replaced the nested loop for token matching with a more efficient list comprehension.
- Enhanced error messaging to specify which tokens were not found in the vector space.
- Added a try-except block around the JSON serialization process to catch JSONEncodeError.
- Introduced a new pytest fixture `non_serializable_items` that generates a list of non-serializable objects for testing.
- Added a test case `test_vicinity_save_and_load_non_serializable_items` to verify that saving a Vicinity instance with non-serializable items raises a JSONEncodeError.
- Updated the Vicinity class documentation to specify that JSONEncodeError may be raised if items are not serializable.
- Introduced HuggingFaceMixin to enable saving and loading Vicinity instances to/from Hugging Face Hub
- Added optional import of HuggingFaceMixin based on huggingface_hub and datasets library availability
- Implemented methods for pushing Vicinity instances to the Hub, including dataset and metadata upload
- Created a method to load Vicinity instances from Hugging Face repositories
@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 64.19753% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
vicinity/integrations/huggingface.py 54.68% 29 Missing ⚠️
Files with missing lines Coverage Δ
tests/integrations/test_huggingface.py 100.00% <100.00%> (ø)
vicinity/vicinity.py 98.63% <100.00%> (+0.95%) ⬆️
vicinity/integrations/huggingface.py 54.68% <54.68%> (ø)

@davidberenstein1957 davidberenstein1957 marked this pull request as draft January 28, 2025 21:20
@Pringled Pringled self-requested a review February 15, 2025 08:17
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice integration, would be awesome to have this in Vicinity 🎉 Left some minor feedback

@Pringled
Copy link
Member

Also feel free to add an example to the main README (e.g. in the quickstart there's a saving/loading part, it can be added under there)

…aset card template

- Added a dataset card template for Hugging Face Hub uploads
- Improved error handling for Hugging Face integration with custom import error
- Updated `push_to_hub` method to include model name/path in configuration
- Removed conditional import of Hugging Face libraries in `vicinity.py`
- Added `huggingface` optional dependency in `pyproject.toml`
… and Hugging Face integration

- Added new optional dependency groups for integrations and backends in pyproject.toml
- Updated README.md with new installation instructions for specific integrations and backends
- Added documentation for pushing and loading vector stores from Hugging Face Hub
- Simplified and clarified installation options in README
- Implemented a new test case for loading a Vicinity instance from Hugging Face Hub
- Added test to verify the print statement when loading from a repository
- Introduced a constant for the print statement in the Hugging Face integration module
- Updated the print statement to use string formatting for better flexibility
@davidberenstein1957 davidberenstein1957 marked this pull request as ready for review February 25, 2025 21:38
- Deleted `tests/test_utils.py` containing tests for normalization utility functions
- Removed `tests/test_vicinity.py` with comprehensive test cases for the Vicinity class
- These test files are no longer needed, likely due to refactoring or migration of tests
@davidberenstein1957
Copy link
Contributor Author

davidberenstein1957 commented Feb 25, 2025

@Pringled

  • I resolved the comments.
  • Added a basic test for testing load_from_hub. Did not add the push_to_hub integration because I did not want to fidget with tokens. Perhaps we can use the dataset we include for testing and transfer it to your Hub org?
  • Added a model_name_or_path argument because vector indices aren't sharable/usable otherwise.

- Implemented `test_utils.py` with tests for vector normalization functions
- Created `test_vicinity.py` with extensive test cases covering Vicinity class methods
- Added `test_huggingface.py` to test Hugging Face integration functionality
- Included tests for various scenarios such as:
  * Initialization and vector handling
  * Querying and thresholding
  * Insertion and deletion of vectors
  * Saving and loading vector stores
  * Handling non-serializable items
  * Hugging Face Hub integration
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very nice! I will put the test dataset on our org as well and do a small followup to change the path. Merging!

@Pringled Pringled merged commit e15511d into MinishLab:main Feb 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants