Skip to content

Add run_async to LlamaCppChatGenerator.#2821

Open
kudos07 wants to merge 3 commits intodeepset-ai:mainfrom
kudos07:feat/llamacpp-add-run-async
Open

Add run_async to LlamaCppChatGenerator.#2821
kudos07 wants to merge 3 commits intodeepset-ai:mainfrom
kudos07:feat/llamacpp-add-run-async

Conversation

@kudos07
Copy link
Contributor

@kudos07 kudos07 commented Feb 9, 2026

Summary

Add run_async to LlamaCppChatGenerator.

  • Implement run_async (wraps run() in asyncio.to_thread since llama-cpp-python is synchronous).
  • Add async unit tests and an optional integration test.
  • Add pytest-asyncio config and a CHANGELOG entry.

Related Issues

Proposed Changes

  • New run_async method on LlamaCppChatGenerator with identical signature to run.
  • Unit tests:
    • TestLlamaCppChatGeneratorAsync::test_run_async
    • TestLlamaCppChatGeneratorAsync::test_run_async_with_params
    • TestLlamaCppChatGeneratorAsync::test_run_async_with_empty_message
  • Optional integration test test_live_run_async (marked integration).
  • PyTest config updated to enable asyncio mode.
  • CHANGELOG updated with an Unreleased note.

How did you test it?

  • Created local venv and installed dependencies (used CPU wheels for llama-cpp-python on Windows).
  • Ran unit tests (mocked) locally:
    • python -m pytest tests/test_chat_generator.py::TestLlamaCppChatGeneratorAsync -v -m "not integration" — all async unit tests passed.
  • Integration test included but may be skipped locally (downloads a GGUF model). CI will run full integration tests.

Notes for the reviewer

  • Implementation is intentionally minimal and consistent with existing patterns (Fallback uses asyncio.to_thread).
  • Streaming behavior is unchanged; run_async currently uses the same streaming semantics as run (streaming callback runs from the thread). If the project prefers a queue-based async streaming bridge (like HF local), that can be implemented in a follow-up.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used an appropriate conventional commit type for the PR title (e.g., feat:)

- Implement run_async (wraps run() in asyncio.to_thread).
- Add async unit tests and optional integration test.
- Add pytest-asyncio config and CHANGELOG entry.
@kudos07 kudos07 requested a review from a team as a code owner February 9, 2026 06:57
@kudos07 kudos07 requested review from anakin87 and removed request for a team February 9, 2026 06:57
@github-actions github-actions bot added integration:llama_cpp type:documentation Improvements or additions to documentation labels Feb 9, 2026
@anakin87
Copy link
Member

anakin87 commented Feb 9, 2026

@kudos07 thank you for the contribution!

I'll take a look in the next few days...

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on possible improvements...

### 🚀 Features

- Add `run_async` to `LlamaCppChatGenerator` for AsyncPipeline support

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is automatically generated at release time, so please remove the addition

return generator

@pytest.mark.integration
async def test_live_run_async(self, generator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def test_live_run_async(self, generator):
@pytest.mark.parametrize("streaming_callback", [None, print_streaming_chunk])
async def test_live_run_async(self, generator):

let's also verify that async+streaming works

:returns: A dictionary with the following keys:
- `replies`: The responses from the model
"""
return await asyncio.to_thread(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my impression is that since llama.cpp python is not thread-safe (abetlen/llama-cpp-python#951), this could be problematic.

A simple idea to fix this is the following:

  1. Lock in __init__ (with an explanatory comment)
    self._inference_lock = asyncio.Lock()

  2. Use the lock in run_async

    async with self._inference_lock:
        return await asyncio.to_thread(self.run, ...)

Ofc, this means only performing one generation at a time in case of multiple requests but it is thread-safe and exposes an async interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:llama_cpp type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add run_async for LlamaCppChatGenerator

2 participants