Skip to content

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929

Open
k1832 wants to merge 1 commit intoVSCodeVim:masterfrom
k1832:fix/surrogate-pairs
Open

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929
k1832 wants to merge 1 commit intoVSCodeVim:masterfrom
k1832:fix/surrogate-pairs

Conversation

@k1832
Copy link

@k1832 k1832 commented Feb 10, 2026

What this PR does / why we need it:

When the cursor is on a character encoded as a UTF-16 surrogate pair — emojis (😄), rare CJK characters (𩸽), mathematical symbols (𝒟, 𝔸), musical symbols (𝄞), etc. — several character-level commands break.

position.getRight() increments by 1 UTF-16 code unit, but these characters are 2 code units. Moving by 1 lands between the pair, and VSCode's validatePosition clamps it back to the start — so the cursor effectively goes backward.

Command Broken behavior
a (Append) Cursor lands before the character instead of after
r (Replace) Only replaces half the character, corrupting the text
s (Change char) Only deletes half the character before entering Insert mode
~ (Toggle case) Corrupts the character into a lone surrogate
ga (Unicode info) Shows the half-surrogate value instead of the full codepoint

Previous fixes addressed insert mode (PR #7977 / #6046) and motions (l/h) and operators (x/X/y), but these 5 commands were missed. This PR completes the surrogate pair handling for character-level commands and adds 12 regression tests to prevent future breakage.

Adds getSurrogateAwareRight()/getSurrogateAwareLeft() helpers on the Position prototype that skip past surrogate pairs, following the same pattern already used by MoveRight (l) / MoveLeft (h). Uses these helpers in the 5 affected commands. Also switches ga from charCodeAt() to codePointAt() to report the full Unicode codepoint.

Which issue(s) this PR fixes

Fixes #9931
Partially addresses #8321 — this PR fixes r and other character-level commands reported there. Remaining issues not addressed: easymotion uses raw character arithmetic for match positioning and marker decorations; xp (transpose) on surrogate pairs is also still broken (TODO added in put.ts).

Special notes for your reviewer:

  • 19/19 surrogate pair tests passing (7 existing + 12 new)
  • Full test suite passes (3157 passing, 0 regressions)
  • Tests cover emojis, rare CJK (𩸽), and mathematical symbols (𝒟, 𝔸) from U+1D400 block
  • Tests require fix: add missing @types/minimatch devDependency #9928 to build when re-resolving dependencies

@k1832 k1832 changed the title Fix surrogate pair (emoji) handling for a, r, s, ~, ga commands fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from 0cdb8e1 to 14abd9a Compare February 10, 2026 03:14
@k1832 k1832 mentioned this pull request Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from 14abd9a to 164974e Compare February 10, 2026 03:26
`position.getRight()` increments by 1 UTF-16 code unit, but emojis
outside the Basic Multilingual Plane are encoded as 2 code units
(a surrogate pair). Moving by 1 code unit lands between the pair,
and VSCode's `validatePosition` clamps it back to the start — so
the cursor effectively goes backward.

Add `getSurrogateAwareRight()`/`getSurrogateAwareLeft()` helpers on
the Position prototype that skip past surrogate pairs, and use them
in the affected commands:

- `a` (Append): cursor now lands after the emoji
- `r` (ReplaceCharacter): replaces the full emoji, not half
- `s` (ChangeOperator): deletes the full emoji before entering insert
- `~` (ToggleCase): advances past the emoji without corrupting it
- `ga` (UnicodeInfo): shows full codepoint (e.g. U+1F604) via
  codePointAt() instead of the half-surrogate from charCodeAt()

Commands already protected (no changes needed): x/X (DeleteOperator),
l/h (MoveRight/MoveLeft), y (YankOperator).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surrogate pair characters (emoji, rare CJK, etc.) break a, r, s, ~, ga commands

1 participant