Add run_async to LlamaCppChatGenerator.#2821
Add run_async to LlamaCppChatGenerator.#2821kudos07 wants to merge 3 commits intodeepset-ai:mainfrom
Conversation
- Implement run_async (wraps run() in asyncio.to_thread). - Add async unit tests and optional integration test. - Add pytest-asyncio config and CHANGELOG entry.
|
@kudos07 thank you for the contribution! I'll take a look in the next few days... |
anakin87
left a comment
There was a problem hiding this comment.
I left some comments on possible improvements...
| ### 🚀 Features | ||
|
|
||
| - Add `run_async` to `LlamaCppChatGenerator` for AsyncPipeline support | ||
|
|
There was a problem hiding this comment.
this file is automatically generated at release time, so please remove the addition
| return generator | ||
|
|
||
| @pytest.mark.integration | ||
| async def test_live_run_async(self, generator): |
There was a problem hiding this comment.
| async def test_live_run_async(self, generator): | |
| @pytest.mark.parametrize("streaming_callback", [None, print_streaming_chunk]) | |
| async def test_live_run_async(self, generator): |
let's also verify that async+streaming works
| :returns: A dictionary with the following keys: | ||
| - `replies`: The responses from the model | ||
| """ | ||
| return await asyncio.to_thread( |
There was a problem hiding this comment.
my impression is that since llama.cpp python is not thread-safe (abetlen/llama-cpp-python#951), this could be problematic.
A simple idea to fix this is the following:
-
Lock in
__init__(with an explanatory comment)
self._inference_lock = asyncio.Lock() -
Use the lock in
run_async
async with self._inference_lock:
return await asyncio.to_thread(self.run, ...)Ofc, this means only performing one generation at a time in case of multiple requests but it is thread-safe and exposes an async interface.
Summary
Add
run_asynctoLlamaCppChatGenerator.run_async(wrapsrun()inasyncio.to_threadsincellama-cpp-pythonis synchronous).pytest-asyncioconfig and a CHANGELOG entry.Related Issues
run_asyncforLlamaCppChatGenerator#1890Proposed Changes
run_asyncmethod onLlamaCppChatGeneratorwith identical signature torun.TestLlamaCppChatGeneratorAsync::test_run_asyncTestLlamaCppChatGeneratorAsync::test_run_async_with_paramsTestLlamaCppChatGeneratorAsync::test_run_async_with_empty_messagetest_live_run_async(markedintegration).How did you test it?
llama-cpp-pythonon Windows).python -m pytest tests/test_chat_generator.py::TestLlamaCppChatGeneratorAsync -v -m "not integration"— all async unit tests passed.Notes for the reviewer
asyncio.to_thread).run_asynccurrently uses the same streaming semantics asrun(streaming callback runs from the thread). If the project prefers a queue-based async streaming bridge (like HF local), that can be implemented in a follow-up.Checklist
feat:)