-
Notifications
You must be signed in to change notification settings - Fork 2.3k
fix(llm): sanitize control characters in function call JSON arguments #4196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(llm): sanitize control characters in function call JSON arguments #4196
Conversation
LLMs sometimes generate JSON with literal control characters (e.g., newlines, tabs) inside string values. These violate the JSON spec and cause pydantic_core's from_json() to fail with: ValueError: control character (\u0000-\u001F) found while parsing a string This adds a sanitization step before parsing that escapes control characters (\n, \r, \t, etc.) within JSON string values while preserving already-escaped sequences. Fixes issue where function tools with multi-line content in arguments would fail to parse.
592c79e to
7bf3b00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds sanitization for control characters in LLM-generated JSON function call arguments to prevent parsing errors. The main issue addressed is that LLMs sometimes generate JSON with literal control characters (newlines, tabs, etc.) inside string values, which violates the JSON specification and causes pydantic_core.from_json() to fail.
Key changes:
- Added
_sanitize_json_control_chars()helper function that escapes control characters within JSON string values - Modified
prepare_function_arguments()to sanitize JSON before parsing - Updated mistralai dependency from 1.9.3 to 1.9.11 (with new invoke and pyyaml dependencies)
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| uv.lock | Updated mistralai dependency from 1.9.3 to 1.9.11 and added invoke 2.2.1 and pyyaml dependencies |
| livekit-agents/livekit/agents/llm/utils.py | Added _sanitize_json_control_chars() function and integrated it into prepare_function_arguments() to escape control characters before JSON parsing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not json_str: | ||
| return json_str | ||
|
|
||
| def escape_control_chars_in_string(match: re.Match[str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some tests for the parsing function? it's important to ensure that it's handling the wide variety of valid JSON inputs without breaking them.
also, isn't it better to run a regex replacement to strip them out?
Description
Problem
LLMs sometimes generate function call JSON with literal control characters (e.g., newlines, tabs) inside string values. For example:
{"prompt": "A timeline showing: - Event 1 - Event 2"}The literal newline violates the JSON spec, causing
pydantic_core.from_json()to fail with:This breaks function tool execution when the LLM outputs multi-line content in tool arguments.
Solution
Add a
_sanitize_json_control_chars()helper that escapes control characters within JSON string values before parsing:\n→\\n\r→\\r\t→\\t\\uXXXXThe function preserves already-escaped sequences and only modifies content inside JSON string values.
Changes
_sanitize_json_control_chars()helper function inutils.pyprepare_function_arguments()to sanitize JSON before callingfrom_json()Testing
Tested with real-world LLM output containing multi-line prompts that previously caused the error.