feat: pre- and post- processing Hooks #91

frayle-ons · 2025-12-08T10:02:28Z

✨ Summary

This PR has merged and utilises changes introduced in #85

This PR introduces a flexible hook system that allows developers to inject custom logic before and after key pipeline stages in the ClassifAI framework, such as VectorStore.search() and Vectoriser.transform(). - i.e. preprocessing and post-processing functions.

The logic utilises Pydantic type checking on the Vectoriser and Vectorstore methods introduced in #85 where the inputs and outputs of the package module methods are validated against Pyndatic models, which expect the input and outputs of these methods to be consistent with their model definitions.

Example usage

def my_preprocess_hook(model):
    # Custom logic to modify the model
    model.query = [x.lower() for x in model.query]
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
}

The above code snippet includes a function that will execute when the VectorStore.search() method is called, taking the pydantic object validated from the arguments passed to the function, modifying the query field of that validated object to convert all queries to lowercase , and then returning the whole pydantic object, which will be validated again and then sent to the main search function.

Underlying logic

Firstly, the user must understand the expected input and output types of the module classes and their methods in the ClassifAI package. For example, if the user understands that the VectorStore.search() method accepts query, ids, n_results and batch_size, they can create a function that modifies any of the arguments on input to the method. Indeed, the Pydantic models created in #85 can help users to better understand the input requirement better:

class SearchInput(BaseModel):
    query: str | list[str]
    ids: list[str | int] | None =
    n_results: int = Field(gt=0)
    batch_size: int = Field(gt=0)

Without any declared hooks, the Pydantic models will pass the input arguments of the VectorStore.search method() to the pydantic model SearchInput, creating a validated_input, which is used in the main code logic. The expected output, which is a results dataframe is passed to the SearchOutput pydantic object, and on succesful validation, is returned.

But if a user adds a hook to the search_preprocess hook then an additional logic step will occur, modifying the SearchInput pydantic validated data with the logic of the hook, and then revalidating with the same pydantic model.

Say the function from before to convert the queries to lower:

def my_preprocess_hook(model):
    # Custom logic to modify the model
    model.query = [x.lower() for x in model.query]
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
}

Similarly, if they add a hook search_postprocess, a similar logic will be executed but on the expected dataframe output object from the function, modifying according to the users hook and then revalidating with the Pydantic model to ensure that the data flow rules of the package are not broken:

def my_second_hook_for_postprocessing(model):
    # Custom logic to modify the model
    model.dataframe= model.dataframe.drop.[unique['query_id'] # or some other function to modify the data
    return model

vectorstore.hooks = {
    "search_preprocess": my_preprocess_hook
    "search_postprocess": my_second_hook_for_postprocessing
}

Method of writing preprocessing and post-processing hooks

Users should write their functions to take in one argument - the validated Pydantic object, and they should return one object, the modified Pydantic object that they have made changes to the fields of. They can understand which fields they are able to modify by studying the Pydantic Model classes themselves, and this will also help them understand what kinds of modifications are not acceptable (for example, the rank column of the results dataframe has to be of type int regardless of how the user chooses to implement their hook.)

Hooks added:

While many of the public facing methods have Pydantic type checking, only the following are available for hooks with corresponding hook dictionary names:

Vectoriser.transform() -> transform_preprocess, transform_postprocess
VectorStore.search() -> search_preprocess, search_postprocess
VectorStore.reverse_search() -> reverse_search_preprocess, reverse_search_postprocess

This means that init methods, and class methods have been removed

📜 Changes Introduced

(feat:) Custom Hooks: Users can define functions (hooks) that modify the validated Pydantic input or output models for supported methods.
Integration: Hooks are registered via a hooks dictionary attribute on the VectorStore or Vectoriser instance, using specific keys like search_preprocess and search_postprocess.
Validation: After a hook function runs, its returned object is re-validated against the same Pydantic model to ensure type safety and consistency.
Extensibility: This enables users to customize, enrich, or filter data at critical points in the pipeline without modifying the core framework.
Demo: A new notebook that shows a user how they can do all sorts of operations to modify the data flow, removing punctuation from queries, deduplicating rows based on the same doc_id for the same query_id, and injecting extra info about the label of the sample, as a new column of data in the returned dataframe.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

Code passes linting with Ruff
Security checks pass using Bandit
API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

I've created a DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb notebook which provides a demo of how to create hooks. Starting a virtual environment, installing ClassifAI with the huggingface optional dependency and then trying to execute the code in this notebook would imply that it works as expected for the tutorial.

Additionally, testing some of the other methods hooks such as reverseSearch and transform of the vectorstore as well as intentionally writing hook functions that should break the pedantic validation to ensure a pedantic error is generated would be a valuable exercise

…at validate input and output of core vectoriser and vectorstore methods excludig init methods

…tion and query_id generation

…gn behaviour with other pydantic classes

…ctions

…nto 86-pre-post-processing-functions

matweldon · 2025-12-10T16:40:07Z

@frayle-ons would it be possible to rebase this onto main so that I can review independently of the pydantic classes PR?

…nd post process hooks

…ic ticket

…antic

frayle-ons · 2025-12-11T17:03:12Z

I've removed the Pydantic classes from this ticket entirely at @matweldon 's request, and reverted back to the simpler checks, things like converting input strings to a list of strings in the vectoriser transform.

Since the pre-processing and post-processing hooks previously relied on using pedantic objects, I've now implemented a solution that works with dictionaries of the input arguments.

The user writing a pre- or post- processing function should expect to work with a dictionary object, where the keys are:

the names of the input arguments to that function - in the case of pre-processing functions
the variable name of the returned object - in the case of the post-processing functions.

user flow for a pre-processing punctuation removal hook on the vectorstore could now be implemented as follows:

creating their function where the input_data param is expected to be a dictionary with keys:
['query', 'ids', 'n_results', 'batch_size' ] with each containing the value of the runtime passed arguments for the search() method

def remove_punctuation(input_data):
    # we wwant to modify the 'texts' field in the input_data pydantic model, which is a list of texts
    # this line removes punctuation from each string with list comprehension
    sanitized_texts = [x.translate(str.maketrans("", "", string.punctuation)) for x in input_data["query"]]

    input_data["query"] = sanitized_texts

    # Return the pydantic object with its modified field
    return input_data

Instantiating the vectostore with the hook:

my_vector_store_with_hooks = VectorStore(
    file_name="data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=vectoriser,
    overwrite=True,
    hooks={
        "search_preprocess": remove_punctuation,
        "search_postprocess": drop_duplicates,
    },
)

Finally when the search method is executed on a running VectorStore object, the following code will execute to pass the input arguments through the user's created function:

      # Check if there is a user defined preprocess hook for the VectorStore search method
        if "search_preprocess" in self.hooks:
            # Pass the args as a dictionary to the preprocessing function
            hook_output = self.hooks["search_preprocess"](
                {"query": query, "ids": ids, "n_results": n_results, "batch_size": batch_size}
            )

            # Unpack the dictionary back into the argument variables
            query = hook_output.get("query", query)
            ids = hook_output.get("ids", ids)
            n_results = hook_output.get("n_results", n_results)
            batch_size = hook_output.get("batch_size", batch_size)

If there is no hook, then this subroutine of code will not execute, and the original values passed as arguments for query, ids, n_results and batch_size will be used in the code pipeline as normal.

This implentation leaves open the ability to re-add pyndatic type checking at a later date

matweldon

LGTM!

frayle-ons and others added 14 commits November 18, 2025 13:57

created boundaries.py file containing pydantic and pandera classes th…

3ca1068

…at validate input and output of core vectoriser and vectorstore methods excludig init methods

cleanup for ruff linting

738a041

pydantic models and imp for vectorisers

71fb87f

better numpy array validation and cleanup

0401c06

pydnatic boundary models for vectorstore class

980d85b

pandera bug fixing for vectorstore search return validation

eb319a1

added pydantic bounadaries for start_api function in servers module

5c606ba

Merge branch 'main' into 79-pydantic-boundaries

4cb4fab

rebuilt lock file

27fcde4

initial code changes for pre and post processing hooks

1d0e45a

udpated conditionals and merged fake soc dataset from rag branch

5bfd10f

hooks cleanup and created notebook tutorial on hooks

b3c0f0a

fixed bugs in pydantic models relating to vectorstore metadata valida…

1ff0649

…tion and query_id generation

small fixes and wrapped pandera schema model in pydantic model to ali…

7fe50e6

…gn behaviour with other pydantic classes

frayle-ons mentioned this pull request Dec 8, 2025

refactor: VectorStore Dataclasses for managing data flow in VectorStore #85

Merged

12 tasks

frayle-ons added 2 commits December 8, 2025 14:39

Merge branch '79-pydantic-boundaries' into 86-pre-post-processing-fun…

cee4afd

…ctions

updated model pipelines for better user interaction

4bada2f

frayle-ons marked this pull request as ready for review December 8, 2025 17:00

Tom-Owen-ONS and others added 2 commits December 8, 2025 17:27

fix(ci): Run CI workflow after switching from draft to ready to review

327dd57

Merge remote-tracking branch 'origin/ci-trigger-on-ready-to-review' i…

624fcba

…nto 86-pre-post-processing-functions

github-actions bot added the enhancement New feature or request label Dec 9, 2025

frayle-ons mentioned this pull request Dec 9, 2025

Package level Error Messaging and Logging #94

Open

frayle-ons changed the base branch from main to 79-pydantic-boundaries December 9, 2025 15:15

removed pydantic logic from pydantic typing ticket leaving just pre a…

47d3e0a

…nd post process hooks

frayle-ons changed the base branch from 79-pydantic-boundaries to main December 11, 2025 16:17

frayle-ons added 2 commits December 11, 2025 16:40

reimplemnted basic type arg value checking that was removed in pydant…

ae0bb74

…ic ticket

typo in code and updated the jupyter notebook to reflect removing pyd…

04a99c4

…antic

frayle-ons requested a review from matweldon December 11, 2025 17:03

Merge branch 'main' into 86-pre-post-processing-functions

9bd5c55

matweldon previously approved these changes Dec 15, 2025

View reviewed changes

removed non vectorstore hooks

f52d4a4

frayle-ons dismissed matweldon’s stale review via f52d4a4 December 16, 2025 17:21

tidying error message strings

01ab89a

matweldon approved these changes Dec 17, 2025

View reviewed changes

frayle-ons merged commit 5b52a4a into main Dec 17, 2025
5 checks passed

matweldon deleted the 86-pre-post-processing-functions branch December 17, 2025 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: pre- and post- processing Hooks #91

feat: pre- and post- processing Hooks #91

Uh oh!

frayle-ons commented Dec 8, 2025 •

edited

Loading

Uh oh!

matweldon commented Dec 10, 2025

Uh oh!

frayle-ons commented Dec 11, 2025

Uh oh!

matweldon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: pre- and post- processing Hooks #91

feat: pre- and post- processing Hooks #91

Uh oh!

Conversation

frayle-ons commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Summary

Example usage

Underlying logic

Method of writing preprocessing and post-processing hooks

Hooks added:

📜 Changes Introduced

✅ Checklist

🔍 How to Test

Uh oh!

matweldon commented Dec 10, 2025

Uh oh!

frayle-ons commented Dec 11, 2025

Uh oh!

matweldon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

frayle-ons commented Dec 8, 2025 •

edited

Loading