Skip to content

Conversation

@frayle-ons
Copy link
Collaborator

@frayle-ons frayle-ons commented Dec 8, 2025

✨ Summary

This PR has merged and utilises changes introduced in #85

This PR introduces a flexible hook system that allows developers to inject custom logic before and after key pipeline stages in the ClassifAI framework, such as VectorStore.search() and Vectoriser.transform(). - i.e. preprocessing and post-processing functions.

The logic utilises Pydantic type checking on the Vectoriser and Vectorstore methods introduced in #85 where the inputs and outputs of the package module methods are validated against Pyndatic models, which expect the input and outputs of these methods to be consistent with their model definitions.

Example usage

def my_preprocess_hook(model):
    # Custom logic to modify the model
    model.query = [x.lower() for x in model.query]
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
}

The above code snippet includes a function that will execute when the VectorStore.search() method is called, taking the pydantic object validated from the arguments passed to the function, modifying the query field of that validated object to convert all queries to lowercase , and then returning the whole pydantic object, which will be validated again and then sent to the main search function.

Underlying logic

Firstly, the user must understand the expected input and output types of the module classes and their methods in the ClassifAI package. For example, if the user understands that the VectorStore.search() method accepts query, ids, n_results and batch_size, they can create a function that modifies any of the arguments on input to the method. Indeed, the Pydantic models created in #85 can help users to better understand the input requirement better:

class SearchInput(BaseModel):
    query: str | list[str]
    ids: list[str | int] | None =
    n_results: int = Field(gt=0)
    batch_size: int = Field(gt=0)

Without any declared hooks, the Pydantic models will pass the input arguments of the VectorStore.search method() to the pydantic model SearchInput, creating a validated_input, which is used in the main code logic. The expected output, which is a results dataframe is passed to the SearchOutput pydantic object, and on succesful validation, is returned.

image

But if a user adds a hook to the search_preprocess hook then an additional logic step will occur, modifying the SearchInput pydantic validated data with the logic of the hook, and then revalidating with the same pydantic model.

Say the function from before to convert the queries to lower:

def my_preprocess_hook(model):
    # Custom logic to modify the model
    model.query = [x.lower() for x in model.query]
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
}
image

Similarly, if they add a hook search_postprocess, a similar logic will be executed but on the expected dataframe output object from the function, modifying according to the users hook and then revalidating with the Pydantic model to ensure that the data flow rules of the package are not broken:

def my_second_hook_for_postprocessing(model):
    # Custom logic to modify the model
    model.dataframe= model.dataframe.drop.[unique['query_id'] # or some other function to modify the data
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
    "search_postprocess": my_second_hook_for_postprocessing
}
image

Method of writing preprocessing and post-processing hooks

Users should write their functions to take in one argument - the validated Pydantic object, and they should return one object, the modified Pydantic object that they have made changes to the fields of. They can understand which fields they are able to modify by studying the Pydantic Model classes themselves, and this will also help them understand what kinds of modifications are not acceptable (for example, the rank column of the results dataframe has to be of type int regardless of how the user chooses to implement their hook.)

Hooks added:

While many of the public facing methods have Pydantic type checking, only the following are available for hooks with corresponding hook dictionary names:

Vectoriser.transform() -> transform_preprocess, transform_postprocess
VectorStore.search() -> search_preprocess, search_postprocess
VectorStore.reverse_search() -> reverse_search_preprocess, reverse_search_postprocess

This means that init methods, and class methods have been removed

📜 Changes Introduced

  • (feat:) Custom Hooks: Users can define functions (hooks) that modify the validated Pydantic input or output models for supported methods.
  • Integration: Hooks are registered via a hooks dictionary attribute on the VectorStore or Vectoriser instance, using specific keys like search_preprocess and search_postprocess.
  • Validation: After a hook function runs, its returned object is re-validated against the same Pydantic model to ensure type safety and consistency.
  • Extensibility: This enables users to customize, enrich, or filter data at critical points in the pipeline without modifying the core framework.
  • Demo: A new notebook that shows a user how they can do all sorts of operations to modify the data flow, removing punctuation from queries, deduplicating rows based on the same doc_id for the same query_id, and injecting extra info about the label of the sample, as a new column of data in the returned dataframe.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

I've created a DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb notebook which provides a demo of how to create hooks. Starting a virtual environment, installing ClassifAI with the huggingface optional dependency and then trying to execute the code in this notebook would imply that it works as expected for the tutorial.

Additionally, testing some of the other methods hooks such as reverseSearch and transform of the vectorstore as well as intentionally writing hook functions that should break the pedantic validation to ensure a pedantic error is generated would be a valuable exercise

@frayle-ons frayle-ons marked this pull request as ready for review December 8, 2025 17:00
@github-actions github-actions bot added the enhancement New feature or request label Dec 9, 2025
@frayle-ons frayle-ons changed the base branch from main to 79-pydantic-boundaries December 9, 2025 15:15
@matweldon
Copy link
Collaborator

@frayle-ons would it be possible to rebase this onto main so that I can review independently of the pydantic classes PR?

@frayle-ons frayle-ons changed the base branch from 79-pydantic-boundaries to main December 11, 2025 16:17
@frayle-ons
Copy link
Collaborator Author

I've removed the Pydantic classes from this ticket entirely at @matweldon 's request, and reverted back to the simpler checks, things like converting input strings to a list of strings in the vectoriser transform.

Since the pre-processing and post-processing hooks previously relied on using pedantic objects, I've now implemented a solution that works with dictionaries of the input arguments.

The user writing a pre- or post- processing function should expect to work with a dictionary object, where the keys are:

  1. the names of the input arguments to that function - in the case of pre-processing functions
  2. the variable name of the returned object - in the case of the post-processing functions.

user flow for a pre-processing punctuation removal hook on the vectorstore could now be implemented as follows:

creating their function where the input_data param is expected to be a dictionary with keys:
['query', 'ids', 'n_results', 'batch_size' ] with each containing the value of the runtime passed arguments for the search() method

def remove_punctuation(input_data):
    # we wwant to modify the 'texts' field in the input_data pydantic model, which is a list of texts
    # this line removes punctuation from each string with list comprehension
    sanitized_texts = [x.translate(str.maketrans("", "", string.punctuation)) for x in input_data["query"]]

    input_data["query"] = sanitized_texts

    # Return the pydantic object with its modified field
    return input_data

Instantiating the vectostore with the hook:

my_vector_store_with_hooks = VectorStore(
    file_name="data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=vectoriser,
    overwrite=True,
    hooks={
        "search_preprocess": remove_punctuation,
        "search_postprocess": drop_duplicates,
    },
)

Finally when the search method is executed on a running VectorStore object, the following code will execute to pass the input arguments through the user's created function:

      # Check if there is a user defined preprocess hook for the VectorStore search method
        if "search_preprocess" in self.hooks:
            # Pass the args as a dictionary to the preprocessing function
            hook_output = self.hooks["search_preprocess"](
                {"query": query, "ids": ids, "n_results": n_results, "batch_size": batch_size}
            )

            # Unpack the dictionary back into the argument variables
            query = hook_output.get("query", query)
            ids = hook_output.get("ids", ids)
            n_results = hook_output.get("n_results", n_results)
            batch_size = hook_output.get("batch_size", batch_size)

If there is no hook, then this subroutine of code will not execute, and the original values passed as arguments for query, ids, n_results and batch_size will be used in the code pipeline as normal.

This implentation leaves open the ability to re-add pyndatic type checking at a later date

@frayle-ons frayle-ons requested a review from matweldon December 11, 2025 17:03
matweldon
matweldon previously approved these changes Dec 15, 2025
Copy link
Collaborator

@matweldon matweldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@frayle-ons frayle-ons merged commit 5b52a4a into main Dec 17, 2025
5 checks passed
@matweldon matweldon deleted the 86-pre-post-processing-functions branch December 17, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants