-
Notifications
You must be signed in to change notification settings - Fork 2
feat: pre- and post- processing Hooks #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…at validate input and output of core vectoriser and vectorstore methods excludig init methods
…tion and query_id generation
…gn behaviour with other pydantic classes
…nto 86-pre-post-processing-functions
|
@frayle-ons would it be possible to rebase this onto main so that I can review independently of the pydantic classes PR? |
…nd post process hooks
|
I've removed the Pydantic classes from this ticket entirely at @matweldon 's request, and reverted back to the simpler checks, things like converting input strings to a list of strings in the vectoriser transform. Since the pre-processing and post-processing hooks previously relied on using pedantic objects, I've now implemented a solution that works with dictionaries of the input arguments. The user writing a pre- or post- processing function should expect to work with a dictionary object, where the keys are:
user flow for a pre-processing punctuation removal hook on the vectorstore could now be implemented as follows: creating their function where the input_data param is expected to be a dictionary with keys: def remove_punctuation(input_data):
# we wwant to modify the 'texts' field in the input_data pydantic model, which is a list of texts
# this line removes punctuation from each string with list comprehension
sanitized_texts = [x.translate(str.maketrans("", "", string.punctuation)) for x in input_data["query"]]
input_data["query"] = sanitized_texts
# Return the pydantic object with its modified field
return input_dataInstantiating the vectostore with the hook: my_vector_store_with_hooks = VectorStore(
file_name="data/fake_soc_dataset.csv",
data_type="csv",
vectoriser=vectoriser,
overwrite=True,
hooks={
"search_preprocess": remove_punctuation,
"search_postprocess": drop_duplicates,
},
)Finally when the search method is executed on a running VectorStore object, the following code will execute to pass the input arguments through the user's created function: # Check if there is a user defined preprocess hook for the VectorStore search method
if "search_preprocess" in self.hooks:
# Pass the args as a dictionary to the preprocessing function
hook_output = self.hooks["search_preprocess"](
{"query": query, "ids": ids, "n_results": n_results, "batch_size": batch_size}
)
# Unpack the dictionary back into the argument variables
query = hook_output.get("query", query)
ids = hook_output.get("ids", ids)
n_results = hook_output.get("n_results", n_results)
batch_size = hook_output.get("batch_size", batch_size)If there is no hook, then this subroutine of code will not execute, and the original values passed as arguments for query, ids, n_results and batch_size will be used in the code pipeline as normal. This implentation leaves open the ability to re-add pyndatic type checking at a later date |
matweldon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
✨ Summary
This PR has merged and utilises changes introduced in #85
This PR introduces a flexible hook system that allows developers to inject custom logic before and after key pipeline stages in the ClassifAI framework, such as VectorStore.search() and Vectoriser.transform(). - i.e. preprocessing and post-processing functions.
The logic utilises Pydantic type checking on the Vectoriser and Vectorstore methods introduced in #85 where the inputs and outputs of the package module methods are validated against Pyndatic models, which expect the input and outputs of these methods to be consistent with their model definitions.
Example usage
The above code snippet includes a function that will execute when the
VectorStore.search()method is called, taking the pydantic object validated from the arguments passed to the function, modifying the query field of that validated object to convert all queries to lowercase , and then returning the whole pydantic object, which will be validated again and then sent to the main search function.Underlying logic
Firstly, the user must understand the expected input and output types of the module classes and their methods in the ClassifAI package. For example, if the user understands that the VectorStore.search() method accepts query, ids, n_results and batch_size, they can create a function that modifies any of the arguments on input to the method. Indeed, the Pydantic models created in #85 can help users to better understand the input requirement better:
Without any declared hooks, the Pydantic models will pass the input arguments of the
VectorStore.searchmethod() to the pydantic model SearchInput, creating a validated_input, which is used in the main code logic. The expected output, which is a results dataframe is passed to the SearchOutput pydantic object, and on succesful validation, is returned.But if a user adds a hook to the
search_preprocesshook then an additional logic step will occur, modifying the SearchInput pydantic validated data with the logic of the hook, and then revalidating with the same pydantic model.Say the function from before to convert the queries to lower:
Similarly, if they add a hook
search_postprocess, a similar logic will be executed but on the expected dataframe output object from the function, modifying according to the users hook and then revalidating with the Pydantic model to ensure that the data flow rules of the package are not broken:Method of writing preprocessing and post-processing hooks
Users should write their functions to take in one argument - the validated Pydantic object, and they should return one object, the modified Pydantic object that they have made changes to the fields of. They can understand which fields they are able to modify by studying the Pydantic Model classes themselves, and this will also help them understand what kinds of modifications are not acceptable (for example, the rank column of the results dataframe has to be of type int regardless of how the user chooses to implement their hook.)
Hooks added:
While many of the public facing methods have Pydantic type checking, only the following are available for hooks with corresponding hook dictionary names:
Vectoriser.transform() -> transform_preprocess, transform_postprocess
VectorStore.search() -> search_preprocess, search_postprocess
VectorStore.reverse_search() -> reverse_search_preprocess, reverse_search_postprocess
This means that init methods, and class methods have been removed
📜 Changes Introduced
✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
I've created a DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb notebook which provides a demo of how to create hooks. Starting a virtual environment, installing ClassifAI with the huggingface optional dependency and then trying to execute the code in this notebook would imply that it works as expected for the tutorial.
Additionally, testing some of the other methods hooks such as reverseSearch and transform of the vectorstore as well as intentionally writing hook functions that should break the pedantic validation to ensure a pedantic error is generated would be a valuable exercise